PART  mlx-metal-kernels REV  v0.19.0 LANG  Python 91% · Metal 9% TARGET  Apple Silicon / MLX ● STATUS: EXPERIMENTAL

Field notes // Apple Silicon kernel lab

No kernel ships
until it can prove
it's right.

mlx-metal-kernels is an experimental library of hand-written Metal kernels for MLX on Apple Silicon — streaming attention, paged KV-caches, quantized matvec, fused decode blocks. Every optimized backend is built next to a plain MLX reference path and has to match it, within tolerance, before it's allowed anywhere near a benchmark.

Published · 12 min read

Correctness firstverify, then optimize

01 / Why hand-write kernels at all

MLX gets you most of the way there.
Decode is the part that's left.

MLX already gives Apple Silicon a serious array framework — autograd, lazy evaluation, unified memory, all handled for you. For most workloads that's genuinely enough, and reaching for custom kernels would be premature.

Token-by-token decoding is a different animal. Batches are small, matvecs are tiny, the workload is latency-bound rather than throughput-bound, and almost every gain comes from a memory-layout decision: how a KV-cache pages, how a quantized weight gets packed, how far a streaming softmax can be pushed before precision quietly falls apart. Those are exactly the places a general-purpose op library leaves performance sitting on the table — and exactly where this repo lives, one primitive at a time.

02 / How a call gets to silicon

Five layers down, one verification path running alongside the whole way.

Every call — attention, decode, a quantized matvec — takes the same route: through the Python ops layer, into a backend router that knows which kernels have actually earned trust on this machine, and down into hand-written Metal. A parallel, plain-MLX reference path shadows every step and feeds the test suite that gates anything from being benchmarked.

mlx-metal-kernels architecture A Python model and ops layer routes through a backend autotuner into hand-written Metal kernels on Apple GPU, while a parallel pure-MLX reference path verifies correctness and feeds the test suite. MODEL & GENERATION config, tokenizer, sampling OPS LAYER attention, decode, quant, norm AUTOTUNE ROUTER picks the verified-fast backend METAL KERNELS kernels/*.metal APPLE GPU unified memory REFERENCE PATH must match, every time TEST SUITE gates every benchmark

solid path = the live call · dashed green = correctness verification

03 / What's actually in the lab

Six families of primitives, built bottom-up.

core — stable default experimental — explicit opt-in only
PIN 01
Attention
mixed
  • Reference MLX attention
  • Baseline streaming attention
  • Row-parallel attention
  • Tiled K/V attention
  • Threadgroup attention v2
  • Shape-specialized D=64 / D=128
  • Experimental simdgroup_d64
PIN 02
Decode & KV-cache
core
  • Contiguous KV-cache update
  • Decode attention
  • Paged KV-cache + paged decode
  • Fused decode blocks from QKV
  • GQA / MQA decode composition
  • GQA / MQA prefill attention
PIN 03
Transformer primitives
core
  • RMSNorm
  • RoPE
  • SwiGLU
  • Residual add, RMSNorm + residual
  • QKV split, QKV split + RoPE
  • QKV + RoPE + cache update
PIN 04
Quantization
mixed
  • q4 / q8 dequantization
  • q4 / q8 decode matvec
  • Parallel + tiled multi-output matvec
  • Quantized QKV / output projection
  • Quantized decode block, MLP block
  • Experimental fused q4/q8 MLP
PIN 05
Model scaffolding
scaffold
  • Toy transformer-layer decode bench
  • Full Llama-like decode-layer experiment
  • Multi-layer decode stack
  • Llama-like config + weight-layout mapping
  • Model-adapter scaffold
  • GQA / MQA utilities
PIN 06
Benchmarking & autotune
core
  • Unified benchmark runner
  • Local report generator
  • Chip-specific backend registry
  • Local autotune cache

04 / The rule that governs everything

Nothing skips this order — not even the fast stuff.

STEP 1/6

Write the reference path

Every operation starts as plain, boring, obviously-correct MLX code. No Metal, no cleverness — just the right answer.

STEP 2/6

Add a correctness-first Metal backend

The first hand-written kernel doesn't have to be fast. It has to match step one, within dtype-appropriate tolerance.

STEP 3/6

Test it against the reference

Random inputs, every time a backend changes — not a one-off check that gets forgotten once a kernel "works."

STEP 4/6

Only then, optimize

Row-parallel, tiled, threadgroup, simdgroup — each variant ships as an explicit, opt-in experiment, never a silent default.

STEP 5/6

Benchmark locally

Numbers come from this machine, this chip, this MLX version — never assumed, never carried over from somewhere else.

STEP 6/6

Pick a backend on purpose

Explicit flags by default. An opt-in autotuner can choose machine-specific defaults — but only for backends that have earned it.

05 / What this deliberately isn't

Scope discipline is a feature, not a disclaimer.

The project says no to a long list of things it could chase. On purpose.

A production inference engine, full stop — this is a kernel lab, not a serving stack.

GPTQ, AWQ, or SmoothQuant — no calibrated, model-quality-preserving quantization here, only correctness-first packaging.

Hugging Face hub downloads or automatic tokenizer loading — everything stays local and dependency-light.

A complete checkpoint-to-served-model pipeline, or production safetensors conversion.

Any performance claim that doesn't come with a benchmark script and a machine it was measured on.

Every one of those is a real, useful project. None of them is this one — yet.

06 / Sign-off sheet

Twenty items shipped. Two left on the bench.

In progress

Optimized GQA Metal decode attention
Fused q4 MLP kernel
Baseline MLX custom Metal attention
Reference correctness path
Row-parallel & tiled K/V attention
RMSNorm, RoPE, SwiGLU, residual
KV-cache update & decode attention
Paged KV-cache & paged decode
Fused decode block helpers
q4/q8 dequant & decode matvec
Parallel & tiled q4/q8 matvec
Quantized decode block
Threadgroup attention v2
Simdgroup attention experiments
Unified benchmark & report suite
Chip-specific autotuning
Toy transformer-layer decode bench
Llama-like model integration scaffold
Quantized MLP block
GQA / MQA support
Checkpoint layout loader scaffold
Real checkpoint adapter scaffold

07 / Run it yourself

It's a real, installable Python package — go break it.

zsh — apple-siliconmlx-metal-kernels
$ pip install mlx pytest
$ pip install -e .
$ pytest tests -q
# every backend re-checked against its reference path
$ python examples/run_basic.py
attention_demo.pypython
import mlx.core as mx
from ops.attention_ops import fast_attention

Q = mx.random.normal((1, 128, 8, 64)).astype(mx.float16)
K = mx.random.normal((1, 128, 8, 64)).astype(mx.float16)
V = mx.random.normal((1, 128, 8, 64)).astype(mx.float16)

O = fast_attention(Q, K, V, causal=True, backend="auto")

No performance numbers are quoted here on purpose — the project's own rule is that a speedup only counts once it's measured on a specific Apple Silicon machine, with the benchmark script that produced it sitting right next to the claim. Run the suite, read docs/, and see what your own chip says.

08 / Repository

github.com/manishklach/mlx-metal-kernels

Open the issues, read the roadmap, or clone it and point your own Mac at the benchmark suite. Forks, correctness bugs, and new backends are exactly what this kind of lab is for.

pythonmacosmlx metalapple-silicongpu-kernels kv-cacheflashattentionllm-inference custom-kernels
Field notes on an open-source kernel lab. Not affiliated with Apple, MLX, or Anthropic.