The KV cache stores attention state for every token at every layer — growing linearly with context length.
KV Cache, Transformer Memory, and Why TurboQuant Matters
The KV cache is one of the most important objects in transformer inference — the model's running memory of prior tokens, the reason long-context serving is expensive, and the exact place where techniques like TurboQuant unlock major gains. This updated edition adds tiled-attention diagrams, outlier handling, hardware co-design, a method comparison, and a GPU-utilization tie-in.
Memory traffic and cache size explode faster than most people expect, becoming the dominant inference bottleneck.
Dense, regular, tensor-shaped KV data is unusually compression-friendly — if you can fuse decompression with attention.
In This Post
The KV cache is not some side detail of transformer inference. It is the model's live memory, and increasingly the performance battle is about how efficiently we store it, move it, and read it back during attention — all while the GPU waits.
What the KV Cache Actually Stores
For every token a transformer processes, each layer produces a Key vector and a Value vector. Those vectors are needed later when future tokens attend back over earlier ones, so the model stores them rather than recomputing from scratch.
That stored memory is the KV cache. In compact notation:
[layer][token][head][dim]
Understanding [layer][token][head][dim]
The KV cache is a four-dimensional tensor. Each axis has a clean meaning:
| Dimension | Meaning |
|---|---|
layer | Which transformer layer produced this representation |
token | Which position in the sequence |
head | Which attention head inside that layer |
dim | Which numerical component inside the head vector |
A toy example: 2 layers, 4 tokens, 2 heads, head-dim = 3 gives:
K[2][4][2][3] → 2 × 4 × 2 × 3 = 48 key values
At real scale (80 layers, 128 K tokens, 128 heads, head-dim 128) the numbers become intimidating quickly — which is exactly the subject of Section 4.
What Happens During Attention
When a new token is generated, each head computes a Query vector Q and compares it
against every stored Key:
scores = softmax( Q · Kᵀ / √d ) output = scores · V
The dot products determine which earlier tokens matter most; the softmax-weighted sum of Values is what the head actually outputs. Multiple heads let different subspaces specialise — syntax, long-range references, entity tracking, code structure, and so on.
Why the KV Cache Gets Huge So Quickly
Scale the numbers to a large production model:
| Parameter | Example Value |
|---|---|
| Layers | 80 |
| Tokens (context) | 128 K |
| Heads | 128 |
| Head dimension | 128 |
Keys alone: 80 × 128,000 × 128 × 128 = ~167 billion elements At FP16 (2 bytes each): ~334 GB — just keys, one request
That does not include Values (same size again), model weights, activations, or other requests sharing the GPU. This is why the KV cache dominates memory budgets for long-context serving.
How It Is Stored Physically on GPUs
The KV cache is not stored as nested arrays. Internally it becomes flat, contiguous GPU memory because that is what hardware prefers.
offset = (((layer × num_tokens + token) × num_heads + head) × head_dim) + dim
- Contiguous memory allows GPUs to coalesce sequential reads into single transactions.
- Regular structure enables vectorised kernels and prefetching.
- Predictable layout is essential for large-scale DMA from HBM to compute units.
Why KV Cache Compression Is So Attractive
KV cache is a near-ideal systems target: dense, contiguous, highly regular, and tensor-shaped. That makes it far more compressible than pointer-heavy structures like trees or hash maps.
| Property | Why It Helps Compression |
|---|---|
| Contiguous layout | Sequential access and vectorised processing |
| Fixed-size vectors | Predictable packing and unpacking |
| Floating-point tensors | Natural quantisation target |
| Repeated structure | SIMD and GPU kernel friendly |
| Large blocks | Efficient for DMA and tiled execution |
The Simplest Compression Strategy: Quantization
Most practical KV compression is not ZIP-style. It is simpler and more useful: store the cache in lower numeric precision.
| Format | Bytes per value | vs FP16 |
|---|---|---|
| FP16 | 2 | 1× |
| FP8 | 1 | 2× smaller |
| INT8 | 1 | 2× smaller |
| INT4 | 0.5 | 4× smaller |
| INT2 (experimental) | 0.25 | 8× smaller |
FP16 → INT4 ≈ 4× memory reduction
Attention is surprisingly tolerant of quantisation noise — especially for older, less frequently attended tokens. That has pushed systems toward mixed-precision KV strategies:
Recent tokens → FP16 Older tokens → INT8 Very old tokens → INT4
The Outlier Problem — and How to Handle It New
Quantisation looks clean in theory. In practice, KV tensors contain high-magnitude outliers that wreck naive uniform quantisation. These appear most prominently in:
- Attention sinks — the first few tokens (often
<BOS>or punctuation) accumulate disproportionately large attention scores and correspondingly large Key magnitudes. - Specific channels — certain feature dimensions consistently produce outliers across all tokens in a layer.
- Specific heads — some attention heads are intrinsically higher-magnitude than others.
When you apply a single global scale factor to compress an entire tensor, the outlier forces a wide numerical range. Normal values get crammed into just a few quantisation bins, destroying precision where you actually need it.
Per-Channel and Per-Head Quantisation
The fix is to apply separate scale factors per channel, per head, or per token group, so that outlier channels get their own range and normal channels retain fine-grained resolution.
One scale per feature dimension (channel). Outlier channels are scaled independently. ~1 extra byte per channel per token for the scale — low overhead, large accuracy gain.
One scale per attention head. Coarser than per-channel but cheaper storage. Works especially well when entire heads show consistently different magnitude distributions.
TurboQuant-style systems typically pair aggressive low-bit storage with per-channel or per-group scale factors stored alongside the compressed cache. The scale factors themselves are small (often FP16 or even FP8) and add modest overhead relative to the compression gains.
Where FlashAttention and PagedAttention Fit In
These two ideas solve different problems and are often confused. It helps to be precise.
| System | Main Job | Why It Matters |
|---|---|---|
| FlashAttention | Make attention computation IO-efficient with tiled execution | Reduces wasteful HBM traffic by keeping working sets local to fast SRAM |
| PagedAttention | Manage KV cache storage like virtual memory pages | Reduces fragmentation and improves utilisation under high concurrency |
FlashAttention is about compute locality
Naive attention materialises a large intermediate score matrix and repeatedly bounces data through HBM. FlashAttention computes attention in tiles that fit into fast on-chip SRAM.
Q tile ↓ load K/V tile ↓ compute partial attention in SRAM / registers ↓ accumulate result into running softmax ↓ advance to next tile
The key insight is that the softmax normalisation can be computed incrementally — you never need the entire row of scores in memory at once.
PagedAttention is about KV cache management
Different requests grow to different lengths, batches change dynamically, and contexts need to be allocated, extended, and sometimes shared (e.g., for prefix caching). PagedAttention treats KV as fixed-size blocks — like OS virtual memory pages — rather than demanding one contiguous buffer per sequence. That reduces fragmentation and improves reuse.
Tiled Attention — A Deep Dive New
Tiling is the single most important idea enabling efficient compressed-KV attention. Let's go deeper than the standard summary.
Why tiling exists: the memory-bandwidth wall
An A100 GPU has ~2 TB/s of HBM bandwidth but can perform ~312 TFLOPS of FP16 compute. For a large KV cache, a naive attention pass spends most of its time waiting for memory, not computing. The arithmetic intensity of naive attention is too low — you move a lot of bytes to do very little math.
On-chip SRAM (shared memory) is ~20× faster than HBM but tiny (~192 KB per SM on an A100). Tiling is the strategy that lets a kernel exploit that fast memory despite its small size.
The tiling loop spelled out
for each Q tile (block of query vectors):
running_sum = 0
running_max = -∞
for each KV tile in the sequence:
load compressed K and V tile from HBM ← only this tile
unpack into SRAM / registers ← temporary, local
apply per-channel scale factors
compute partial scores: S = Q · Kᵀ / √d
update online softmax (running_max, running_sum)
accumulate weighted V into output
discard decompressed tile ← never lives in HBM
write final output tile to HBM ← once per Q tile
Tile size: the core engineering constraint
The tile size is not arbitrary. It must satisfy a hard constraint: the temporary decompressed K and V vectors, plus the query vectors and accumulators, must fit in the available SRAM budget of one Streaming Multiprocessor.
The optimal tile size depends on the SRAM budget, the number of active warps, the number of heads, the head dimension, and — crucially with compressed KV — the expansion factor of the decompression step. INT4 → FP16 doubles the byte count in SRAM, so you can only fit half as many tokens in the same budget. That is why compressed-KV kernels require careful re-tuning of tile shapes and often use smaller tiles than their FP16 equivalents.
Online softmax: the algorithmic enabler
The reason tiling works across the sequence dimension (not just the batch or head dimension) is the online softmax trick from Milakov & Gimelshein (2018), formalised and popularised by the FlashAttention papers. The key identity is:
softmax(x)_i = exp(x_i - max(x)) / Σ exp(x_j - max(x))
You can update both max(x) and the normalisation constant Σ
incrementally as new tiles arrive, without ever storing the full score row. This makes the
memory footprint of the attention kernel O(tile_size) rather than O(sequence_length).
Where TurboQuant Fits
TurboQuant attacks a very specific bottleneck extremely well: KV cache bandwidth and storage during decode-phase inference. It is a systems optimisation, not just a compression trick.
A useful layered view:
FlashAttention → optimises how attention is computed (tiled execution)
PagedAttention → optimises how KV is stored and allocated (block management)
TurboQuant → optimises how many bytes those blocks contain (low-bit KV)
↑ sits on top of both, benefits from both
The appeal is clear. If you can shrink keys and values aggressively while preserving benchmark quality, you unlock:
TurboQuant — Advantages and Disadvantages New
Advantages
| Advantage | Detail |
|---|---|
| Memory bandwidth reduction | INT4 KV cuts bytes transferred per attention step by ~4×, which directly translates to higher throughput on memory-bandwidth-limited decode workloads. |
| Larger effective batch size | Smaller KV footprint means more concurrent requests fit in HBM, improving GPU utilisation under bursty agentic traffic. |
| Longer context windows | A 4× memory reduction roughly allows 4× longer contexts for the same GPU memory budget, enabling 128 K → 512 K context without additional hardware. |
| Compatible with paging | Compressed KV pages are structurally identical to uncompressed ones — just smaller. PagedAttention systems need minimal changes to page compressed blocks. |
| Benchmark accuracy | Several TurboQuant-family systems report near-lossless quality on standard language benchmarks with INT4 KV, especially with per-channel scales. |
| Workload-targeted | KV cache is a well-scoped optimisation target. Unlike weight quantisation, KV quantisation does not affect prefill compute paths. |
Disadvantages
| Disadvantage | Why It Matters |
|---|---|
| Dequantisation overhead | Every attention tile must unpack bits and apply scale factors. On memory-compute-balanced workloads (short context, high batch), this overhead can negate gains. |
| Kernel complexity | Fused compress-decompress-attention kernels are significantly harder to write, validate, and maintain than standard FlashAttention kernels. Correctness bugs can be subtle. |
| "Zero accuracy loss" is benchmark-scoped | Softmax can amplify small perturbations nonlinearly. Benchmark success ≠ lossless. Adversarial inputs or tasks that depend on precise score ratios may degrade. |
| Outlier sensitivity | Without per-channel scales, a single high-magnitude channel forces a wide quantisation range and crushes precision everywhere else. Proper outlier handling adds metadata storage and complicates kernels further. |
| Prefill not helped | The prefill (prompt processing) phase is typically compute-bound, not bandwidth-bound. TurboQuant only improves the decode phase, where memory bandwidth dominates. |
| Tile size re-tuning required | INT4 → FP16 expansion inside SRAM changes the effective tile size that fits in shared memory. Existing FlashAttention tile configs are not reusable without adjustment. |
| Does not help training | Inference-only: training still faces activations, gradients, and optimizer state. This is a targeted inference win, not a universal memory solution. |
Comparing KV Compression Methods New
TurboQuant is not the only approach to KV memory reduction. The landscape includes several complementary or competing strategies, each with different tradeoffs.
| Method | Strategy | Compression | Key Strength | Key Weakness |
|---|---|---|---|---|
| FP8 KV | Lower-precision storage (1 byte) | 2× vs FP16 | Simplest to implement; native hardware support on H100 / Blackwell | Modest gain; outlier sensitivity still present |
| KIVI (INT8) | Per-channel INT8 with residual | 2× vs FP16 | Strong accuracy; low kernel complexity; residual correction for outliers | 2× is modest vs 4× INT4; residual adds memory |
| TurboQuant (INT4) | Low-bit per-group quant + tiled decompress | 4× vs FP16 | Excellent bandwidth savings; fused decompress-attention kernel | Kernel complexity; tile re-tuning needed; outlier handling required |
| KVQuant (NF4/INT2) | Learned non-uniform quantisation (NF4) or aggressive INT2 | 4–8× vs FP16 | Best compression ratio; NF4 adapts to KV distributions | Higher accuracy risk at INT2; complex calibration |
| SnapKV / H2O (Eviction) | Drop low-salience tokens from the cache entirely | Variable (task-dependent) | Can be combined with quantisation; semantic rather than numeric compression | Irreversible; risky for tasks requiring exact token recall |
| GQA / MQA (Architecture) | Share K/V across multiple query heads | 8–32× on cache size | Zero quantisation noise; done at model design time | Requires retraining; not applicable to existing models |
| Learned Summarisation | Compress old context into learned memory objects | Potentially unbounded | Handles arbitrarily long contexts; semantic lossiness can be acceptable | Experimental; significant accuracy risk on long-recall tasks |
Hardware-Software Co-Design New
Quantisation used to be a purely software trick applied on top of general-purpose hardware. That assumption is no longer valid. With NVIDIA Blackwell (GB200, B100) and AMD MI300X, low-bit arithmetic is first-class silicon.
Blackwell's FP4 and FP8 engines
Blackwell introduces native FP4 tensor core support. This fundamentally changes the TurboQuant math: instead of INT4 values being unpacked into FP16 before the matrix multiply, the multiply-accumulate can happen directly on FP4 operands. The practical implications:
No explicit unpack step before GEMM. The dequantisation can be fused into the hardware pipeline rather than burning shader arithmetic. Effective throughput for KV attention roughly doubles versus software INT4.
Writing freshly computed K/V vectors back to cache in FP8 (rather than FP16) halves the store bandwidth. Combined with FP4 reads, the net KV bandwidth saving approaches 8× vs a naive FP16 baseline.
CXL and memory disaggregation
Beyond the GPU die itself, a second architectural shift is underway: Compute Express Link (CXL) allows KV cache pages to live in host DRAM and be directly accessed by the GPU without explicit CPU involvement. Combined with NVLink fabrics, this enables multi-GPU KV sharing — one server's KV cache can be read by a different GPU handling a different request batch.
This turns KV quantisation into a system-level optimisation: compressed INT4 pages are not just smaller in HBM; they are also cheaper to transfer across CXL links, reducing the bandwidth tax on disaggregated inference architectures.
Implications for kernel design
When the hardware natively handles FP4/FP8 arithmetic, the software stack needs to change too:
- Quantisation granularity must align with tensor core tile requirements (typically multiples of 16 or 32 elements per group).
- Scale factors must be stored in formats the hardware can consume directly, not just software-readable FP32.
- Compiler passes (PTX, SASS) must recognise and fuse the decompress-GEMM pattern to avoid extra register traffic.
The Real Tradeoffs Behind TurboQuant-Style Systems
The big story is not "compression good." The real story is that compression shifts cost from memory movement into kernel sophistication, quantisation logic, and runtime scheduling.
| Benefit | What You Pay For It |
|---|---|
| Less memory footprint | More packing and unpacking logic; scale metadata storage |
| Lower bandwidth demand | Dequantisation overhead; outlier handling |
| Larger contexts | More complex scheduling, paging, and tile re-tuning |
| Higher concurrency | More sophisticated runtime and compiler behavior |
| HW-native efficiency (Blackwell) | Precision format must match tensor core requirements |
Why Fully Decompressing Back Into HBM Defeats the Point
One of the deepest practical questions is whether the cache must be fully restored into global memory before attention can use it. If yes, most of the gain disappears:
Bad pipeline: load compressed KV → decompress entire KV → store in HBM → run attention (doubles bandwidth; negates compression)
Better pipeline: load compressed chunk → dequantize tiny tile in SRAM/registers → immediately use in attention math → discard temporary
The decompressed KV never exists as one giant persistent tensor. Small pieces are staged in SRAM or register files, used immediately, then thrown away.
Why SRAM is the critical staging area
HBM: large, ~2 TB/s bandwidth, far from compute L2 cache: intermediate, ~20 TB/s, shared across SMs SRAM/shared mem: small (~192 KB/SM), ~80 TB/s, private to SM Registers: fastest (~80+ TB/s effective), per-thread, extremely limited
The tiled decompress loop
for each KV tile:
load packed low-bit KV from HBM
unpack bitfields
apply per-channel scales / corrections
place temporary vectors in SRAM or registers
run QKᵀ and weighted-V math
discard temporary tile
→ continue to next tile
The tile size is chosen so the temporary decompressed working set fits the SM's SRAM budget. That sizing problem — accounting for INT4 → FP16 expansion — is one of the central engineering challenges in compressed-KV kernels.
How PagedAttention connects
PagedAttention manages the global cache organisation (which blocks live where), while tiled decompression manages the local compute path (how each block is consumed). They are complementary layers of the same stack.
GPU Utilisation — The Hidden Culprit New
Engineers monitoring inference infrastructure often notice that GPU utilisation numbers (SM active %, MFU, or hardware performance counter readings) tell a confusing story: the GPU appears underloaded even during heavy traffic. KV-cache-related bottlenecks are a frequent — and frequently overlooked — root cause.
KV fragmentation causes stall bubbles
In a PagedAttention-style system, the KV cache is divided into fixed-size blocks. As requests arrive, grow, and complete, blocks become allocated and freed in non-sequential patterns. If the block allocator runs out of contiguous free regions — or if the eviction policy is too aggressive — the scheduler stalls waiting for space, leaving SMs idle.
Tools like gpu-low-util-monitor will surface this as a sustained gap between
hardware capability and observed throughput. The SM utilisation counter may read 35–50% while
the workload seems theoretically heavier, precisely because attention steps are blocked waiting
for KV allocation, not because the model is computationally light.
Dequantisation stalls under INT4
A second, subtler pattern appears when INT4 KV kernels are deployed without careful tile tuning. If the decompression step spills from registers into shared memory — or worse, into L2 — the kernel's arithmetic pipeline stalls, dropping SM utilisation even though the GPU is nominally "running attention." This manifests as low MFU (model FLOP utilisation), not as a scheduling issue. The hardware is occupied but not computing useful work.
ncu, and (c) whether the batch
scheduler is padding short sequences with empty KV blocks to maintain alignment.
How compression changes the utilisation profile
Counterintuitively, adding INT4 KV compression can increase measured GPU utilisation — not because the kernel runs faster per step, but because fewer HBM stalls mean more time actually computing. On memory-bandwidth-limited workloads the GPU transitions from "waiting for bytes" to "actually multiplying," which shows up as higher SM occupancy in profiling tools.
Where This Is Heading Next
Current KV compression is powerful, but still relatively early. Several directions look especially important.
Semantic compression
Not all tokens matter equally. Low-value filler tokens could be compressed far more aggressively than critical reasoning spans, named entities, or code structure. This requires attention-aware importance scoring at write time — expensive, but increasingly tractable.
Attention-aware precision assignment
Frequently attended tokens stay high precision; low-salience tokens get downgraded. This turns the KV cache into an adaptive memory system that mirrors the model's own attention distribution.
Hierarchical KV tiers
Once you think about the KV cache as a tiered memory system, transformer inference starts to look remarkably like OS storage hierarchy design:
SRAM (on-chip) → active tile, FP16, sub-microsecond access HBM → hot KV cache, FP8, ~microsecond DRAM / CXL → warm KV, INT4, ~tens of microseconds NVMe SSD → cold / summarised memory, token-level eviction
A useful mental model: HBM is your desk (immediate work), DRAM is your room (accessible with a few steps), SSD is a storage locker (archived, but retrievable with some latency). The engineering challenge is managing eviction and retrieval policies across these tiers without introducing unacceptable decode latency spikes.
Learned compression
The most ambitious direction is to stop storing every old token literally. Instead, future systems may store compressed semantic state, learned summaries, or dynamically maintained memory objects — analogous to how humans compress past context into high-level abstractions rather than verbatim recall.
Hardware-software co-evolution
As Blackwell and future generations standardise FP4/FP8 tensor cores, quantisation will increasingly be specified in terms of hardware tile formats rather than abstract bit widths. The design space for TurboQuant-style systems will shift from "how do we avoid the unpack penalty" to "which precision format aligns best with the hardware's native data path."
The central question is no longer just whether we can compress the KV cache. It is whether we can compress it without making attention more expensive than the bytes we saved — and whether the silicon itself will eventually make that tradeoff disappear.