Field notes // Apple Silicon kernel lab
No kernel ships
until it can prove
it's right.
mlx-metal-kernels is an experimental library of hand-written Metal kernels for MLX on Apple Silicon — streaming attention, paged KV-caches, quantized matvec, fused decode blocks. Every optimized backend is built next to a plain MLX reference path and has to match it, within tolerance, before it's allowed anywhere near a benchmark.
01 / Why hand-write kernels at all
MLX gets you most of the way there.
Decode is the part that's left.
MLX already gives Apple Silicon a serious array framework — autograd, lazy evaluation, unified memory, all handled for you. For most workloads that's genuinely enough, and reaching for custom kernels would be premature.
Token-by-token decoding is a different animal. Batches are small, matvecs are tiny, the workload is latency-bound rather than throughput-bound, and almost every gain comes from a memory-layout decision: how a KV-cache pages, how a quantized weight gets packed, how far a streaming softmax can be pushed before precision quietly falls apart. Those are exactly the places a general-purpose op library leaves performance sitting on the table — and exactly where this repo lives, one primitive at a time.
02 / How a call gets to silicon
Five layers down, one verification path running alongside the whole way.
Every call — attention, decode, a quantized matvec — takes the same route: through the Python ops layer, into a backend router that knows which kernels have actually earned trust on this machine, and down into hand-written Metal. A parallel, plain-MLX reference path shadows every step and feeds the test suite that gates anything from being benchmarked.
solid path = the live call · dashed green = correctness verification
03 / What's actually in the lab
Six families of primitives, built bottom-up.
- Reference MLX attention
- Baseline streaming attention
- Row-parallel attention
- Tiled K/V attention
- Threadgroup attention v2
- Shape-specialized D=64 / D=128
- Experimental simdgroup_d64
- Contiguous KV-cache update
- Decode attention
- Paged KV-cache + paged decode
- Fused decode blocks from QKV
- GQA / MQA decode composition
- GQA / MQA prefill attention
- RMSNorm
- RoPE
- SwiGLU
- Residual add, RMSNorm + residual
- QKV split, QKV split + RoPE
- QKV + RoPE + cache update
- q4 / q8 dequantization
- q4 / q8 decode matvec
- Parallel + tiled multi-output matvec
- Quantized QKV / output projection
- Quantized decode block, MLP block
- Experimental fused q4/q8 MLP
- Toy transformer-layer decode bench
- Full Llama-like decode-layer experiment
- Multi-layer decode stack
- Llama-like config + weight-layout mapping
- Model-adapter scaffold
- GQA / MQA utilities
- Unified benchmark runner
- Local report generator
- Chip-specific backend registry
- Local autotune cache
04 / The rule that governs everything
Nothing skips this order — not even the fast stuff.
Write the reference path
Every operation starts as plain, boring, obviously-correct MLX code. No Metal, no cleverness — just the right answer.
Add a correctness-first Metal backend
The first hand-written kernel doesn't have to be fast. It has to match step one, within dtype-appropriate tolerance.
Test it against the reference
Random inputs, every time a backend changes — not a one-off check that gets forgotten once a kernel "works."
Only then, optimize
Row-parallel, tiled, threadgroup, simdgroup — each variant ships as an explicit, opt-in experiment, never a silent default.
Benchmark locally
Numbers come from this machine, this chip, this MLX version — never assumed, never carried over from somewhere else.
Pick a backend on purpose
Explicit flags by default. An opt-in autotuner can choose machine-specific defaults — but only for backends that have earned it.
05 / What this deliberately isn't
Scope discipline is a feature, not a disclaimer.
The project says no to a long list of things it could chase. On purpose.
A production inference engine, full stop — this is a kernel lab, not a serving stack.
GPTQ, AWQ, or SmoothQuant — no calibrated, model-quality-preserving quantization here, only correctness-first packaging.
Hugging Face hub downloads or automatic tokenizer loading — everything stays local and dependency-light.
A complete checkpoint-to-served-model pipeline, or production safetensors conversion.
Any performance claim that doesn't come with a benchmark script and a machine it was measured on.
Every one of those is a real, useful project. None of them is this one — yet.
06 / Sign-off sheet
Twenty items shipped. Two left on the bench.
In progress
07 / Run it yourself
It's a real, installable Python package — go break it.
$ pip install mlx pytest $ pip install -e . $ pytest tests -q # every backend re-checked against its reference path $ python examples/run_basic.py
import mlx.core as mx from ops.attention_ops import fast_attention Q = mx.random.normal((1, 128, 8, 64)).astype(mx.float16) K = mx.random.normal((1, 128, 8, 64)).astype(mx.float16) V = mx.random.normal((1, 128, 8, 64)).astype(mx.float16) O = fast_attention(Q, K, V, causal=True, backend="auto")
No performance numbers are quoted here on purpose — the project's own rule is that a
speedup only counts once it's measured on a specific Apple Silicon machine, with the
benchmark script that produced it sitting right next to the claim. Run the suite, read
docs/, and see what your own chip says.