PART mlx-metal-kernels REV v0.19.0 LANG Python 91% · Metal 9% TARGET Apple Silicon / MLX ● STATUS: EXPERIMENTAL

Field notes // Apple Silicon kernel lab

No kernel ships
until it can prove
it's right.

mlx-metal-kernels is an experimental library of hand-written Metal kernels for MLX on Apple Silicon — streaming attention, paged KV-caches, quantized matvec, fused decode blocks. Every optimized backend is built next to a plain MLX reference path and has to match it, within tolerance, before it's allowed anywhere near a benchmark.

Published June 20, 2026 · 12 min read

View the repo ↗ Read the quick start ↓

Correctness firstverify, then optimize

01 / Why hand-write kernels at all

MLX gets you most of the way there.
Decode is the part that's left.

MLX already gives Apple Silicon a serious array framework — autograd, lazy evaluation, unified memory, all handled for you. For most workloads that's genuinely enough, and reaching for custom kernels would be premature.

Token-by-token decoding is a different animal. Batches are small, matvecs are tiny, the workload is latency-bound rather than throughput-bound, and almost every gain comes from a memory-layout decision: how a KV-cache pages, how a quantized weight gets packed, how far a streaming softmax can be pushed before precision quietly falls apart. Those are exactly the places a general-purpose op library leaves performance sitting on the table — and exactly where this repo lives, one primitive at a time.

02 / How a call gets to silicon

Five layers down, one verification path running alongside the whole way.

Every call — attention, decode, a quantized matvec — takes the same route: through the Python ops layer, into a backend router that knows which kernels have actually earned trust on this machine, and down into hand-written Metal. A parallel, plain-MLX reference path shadows every step and feeds the test suite that gates anything from being benchmarked.

solid path = the live call · dashed green = correctness verification

03 / What's actually in the lab

Six families of primitives, built bottom-up.

core — stable default experimental — explicit opt-in only

PIN 01

Attention

mixed

Reference MLX attention
Baseline streaming attention
Row-parallel attention
Tiled K/V attention
Threadgroup attention v2
Shape-specialized D=64 / D=128
Experimental simdgroup_d64

PIN 02

Decode & KV-cache

core

Contiguous KV-cache update
Decode attention
Paged KV-cache + paged decode
Fused decode blocks from QKV
GQA / MQA decode composition
GQA / MQA prefill attention

PIN 03

Transformer primitives

core

RMSNorm
RoPE
SwiGLU
Residual add, RMSNorm + residual
QKV split, QKV split + RoPE
QKV + RoPE + cache update

PIN 04

Quantization

mixed

q4 / q8 dequantization
q4 / q8 decode matvec
Parallel + tiled multi-output matvec
Quantized QKV / output projection
Quantized decode block, MLP block
Experimental fused q4/q8 MLP

PIN 05

Model scaffolding

scaffold

Toy transformer-layer decode bench
Full Llama-like decode-layer experiment
Multi-layer decode stack
Llama-like config + weight-layout mapping
Model-adapter scaffold
GQA / MQA utilities

PIN 06

Benchmarking & autotune

core

Unified benchmark runner
Local report generator
Chip-specific backend registry
Local autotune cache

04 / The rule that governs everything

Nothing skips this order — not even the fast stuff.

STEP 1/6

Write the reference path

Every operation starts as plain, boring, obviously-correct MLX code. No Metal, no cleverness — just the right answer.

STEP 2/6

Add a correctness-first Metal backend

The first hand-written kernel doesn't have to be fast. It has to match step one, within dtype-appropriate tolerance.

STEP 3/6

Test it against the reference

Random inputs, every time a backend changes — not a one-off check that gets forgotten once a kernel "works."

STEP 4/6

Only then, optimize

Row-parallel, tiled, threadgroup, simdgroup — each variant ships as an explicit, opt-in experiment, never a silent default.

STEP 5/6

Benchmark locally

Numbers come from this machine, this chip, this MLX version — never assumed, never carried over from somewhere else.

STEP 6/6

Pick a backend on purpose

Explicit flags by default. An opt-in autotuner can choose machine-specific defaults — but only for backends that have earned it.

05 / What this deliberately isn't

Scope discipline is a feature, not a disclaimer.

The project says no to a long list of things it could chase. On purpose.

A production inference engine, full stop — this is a kernel lab, not a serving stack.

GPTQ, AWQ, or SmoothQuant — no calibrated, model-quality-preserving quantization here, only correctness-first packaging.

Hugging Face hub downloads or automatic tokenizer loading — everything stays local and dependency-light.

A complete checkpoint-to-served-model pipeline, or production safetensors conversion.

Any performance claim that doesn't come with a benchmark script and a machine it was measured on.

Every one of those is a real, useful project. None of them is this one — yet.

06 / Sign-off sheet

Twenty items shipped. Two left on the bench.

In progress

Optimized GQA Metal decode attention

Fused q4 MLP kernel

Baseline MLX custom Metal attention

Reference correctness path

Row-parallel & tiled K/V attention

RMSNorm, RoPE, SwiGLU, residual

KV-cache update & decode attention

Paged KV-cache & paged decode

Fused decode block helpers

q4/q8 dequant & decode matvec

Parallel & tiled q4/q8 matvec

Quantized decode block

Threadgroup attention v2

Simdgroup attention experiments

Unified benchmark & report suite

Chip-specific autotuning

Toy transformer-layer decode bench

Llama-like model integration scaffold

Quantized MLP block

GQA / MQA support

Checkpoint layout loader scaffold

Real checkpoint adapter scaffold

07 / Run it yourself

It's a real, installable Python package — go break it.

zsh — apple-siliconmlx-metal-kernels

$ pip install mlx pytest
$ pip install -e .
$ pytest tests -q
# every backend re-checked against its reference path
$ python examples/run_basic.py

attention_demo.pypython

import mlx.core as mx
from ops.attention_ops import fast_attention

Q = mx.random.normal((1, 128, 8, 64)).astype(mx.float16)
K = mx.random.normal((1, 128, 8, 64)).astype(mx.float16)
V = mx.random.normal((1, 128, 8, 64)).astype(mx.float16)

O = fast_attention(Q, K, V, causal=True, backend="auto")

No performance numbers are quoted here on purpose — the project's own rule is that a speedup only counts once it's measured on a specific Apple Silicon machine, with the benchmark script that produced it sitting right next to the claim. Run the suite, read docs/, and see what your own chip says.

MLX gets you most of the way there.Decode is the part that's left.

Five layers down, one verification path running alongside the whole way.

Six families of primitives, built bottom-up.

Nothing skips this order — not even the fast stuff.

Write the reference path

Add a correctness-first Metal backend

Test it against the reference

Only then, optimize

Benchmark locally

Pick a backend on purpose

Scope discipline is a feature, not a disclaimer.

Twenty items shipped. Two left on the bench.

It's a real, installable Python package — go break it.

MLX gets you most of the way there.
Decode is the part that's left.