Long-Form Technical Deep-Dive · v2

KV Cache, Transformer Memory, and Why TurboQuant Matters

The KV cache is one of the most important objects in transformer inference — the model's running memory of prior tokens, the reason long-context serving is expensive, and the exact place where techniques like TurboQuant unlock major gains. This updated edition adds tiled-attention diagrams, outlier handling, hardware co-design, a method comparison, and a GPU-utilization tie-in.

· 19 min read Topics: Transformers · KV Cache · Inference Systems · Quantization Audience: Engineers, AI infra builders, curious practitioners v2 — expanded edition with diagrams
Core Object

The KV cache stores attention state for every token at every layer — growing linearly with context length.

Core Problem

Memory traffic and cache size explode faster than most people expect, becoming the dominant inference bottleneck.

Core Opportunity

Dense, regular, tensor-shaped KV data is unusually compression-friendly — if you can fuse decompression with attention.

In This Post

The KV cache is not some side detail of transformer inference. It is the model's live memory, and increasingly the performance battle is about how efficiently we store it, move it, and read it back during attention — all while the GPU waits.

What the KV Cache Actually Stores

For every token a transformer processes, each layer produces a Key vector and a Value vector. Those vectors are needed later when future tokens attend back over earlier ones, so the model stores them rather than recomputing from scratch.

That stored memory is the KV cache. In compact notation:

[layer][token][head][dim]
KV cache = remembered attention state for every token at every layer. If the model has seen thousands of tokens, every new token may need to compare itself against a very large set of previously stored keys, then blend the corresponding values. That is why KV state becomes so central to inference cost.

Understanding [layer][token][head][dim]

The KV cache is a four-dimensional tensor. Each axis has a clean meaning:

DimensionMeaning
layerWhich transformer layer produced this representation
tokenWhich position in the sequence
headWhich attention head inside that layer
dimWhich numerical component inside the head vector

A toy example: 2 layers, 4 tokens, 2 heads, head-dim = 3 gives:

K[2][4][2][3]   → 2 × 4 × 2 × 3 = 48 key values

At real scale (80 layers, 128 K tokens, 128 heads, head-dim 128) the numbers become intimidating quickly — which is exactly the subject of Section 4.

What Happens During Attention

When a new token is generated, each head computes a Query vector Q and compares it against every stored Key:

scores = softmax( Q · Kᵀ / √d )
output  = scores · V

The dot products determine which earlier tokens matter most; the softmax-weighted sum of Values is what the head actually outputs. Multiple heads let different subspaces specialise — syntax, long-range references, entity tracking, code structure, and so on.

Why the KV Cache Gets Huge So Quickly

Scale the numbers to a large production model:

ParameterExample Value
Layers80
Tokens (context)128 K
Heads128
Head dimension128
Keys alone: 80 × 128,000 × 128 × 128 = ~167 billion elements
At FP16 (2 bytes each): ~334 GB  — just keys, one request

That does not include Values (same size again), model weights, activations, or other requests sharing the GPU. This is why the KV cache dominates memory budgets for long-context serving.

How It Is Stored Physically on GPUs

The KV cache is not stored as nested arrays. Internally it becomes flat, contiguous GPU memory because that is what hardware prefers.

offset = (((layer × num_tokens + token) × num_heads + head) × head_dim) + dim

Why KV Cache Compression Is So Attractive

KV cache is a near-ideal systems target: dense, contiguous, highly regular, and tensor-shaped. That makes it far more compressible than pointer-heavy structures like trees or hash maps.

PropertyWhy It Helps Compression
Contiguous layoutSequential access and vectorised processing
Fixed-size vectorsPredictable packing and unpacking
Floating-point tensorsNatural quantisation target
Repeated structureSIMD and GPU kernel friendly
Large blocksEfficient for DMA and tiled execution
The catch: attention repeatedly reads the KV cache during generation. Compression only wins if decompression stays cheaper than moving the original bytes.

The Simplest Compression Strategy: Quantization

Most practical KV compression is not ZIP-style. It is simpler and more useful: store the cache in lower numeric precision.

FormatBytes per valuevs FP16
FP162
FP812× smaller
INT812× smaller
INT40.54× smaller
INT2 (experimental)0.258× smaller
FP16 → INT4  ≈ 4× memory reduction

Attention is surprisingly tolerant of quantisation noise — especially for older, less frequently attended tokens. That has pushed systems toward mixed-precision KV strategies:

Recent tokens    → FP16
Older tokens     → INT8
Very old tokens  → INT4

The Outlier Problem — and How to Handle It New

Quantisation looks clean in theory. In practice, KV tensors contain high-magnitude outliers that wreck naive uniform quantisation. These appear most prominently in:

Key vector magnitudes across a single token — outlier channels circled outlier channel normal range ≈ 0.5–2.0 outlier can be 10–100×

When you apply a single global scale factor to compress an entire tensor, the outlier forces a wide numerical range. Normal values get crammed into just a few quantisation bins, destroying precision where you actually need it.

Per-Channel and Per-Head Quantisation

The fix is to apply separate scale factors per channel, per head, or per token group, so that outlier channels get their own range and normal channels retain fine-grained resolution.

Per-channel quantisation
One scale per feature dimension (channel). Outlier channels are scaled independently. ~1 extra byte per channel per token for the scale — low overhead, large accuracy gain.
Per-head quantisation
One scale per attention head. Coarser than per-channel but cheaper storage. Works especially well when entire heads show consistently different magnitude distributions.

TurboQuant-style systems typically pair aggressive low-bit storage with per-channel or per-group scale factors stored alongside the compressed cache. The scale factors themselves are small (often FP16 or even FP8) and add modest overhead relative to the compression gains.

Watch for attention sinks: token position 0 often behaves like a "dump" for attention weight — its Key vector can have extreme magnitudes. Some systems keep the first few tokens in full FP16 and only compress the rest.

Where FlashAttention and PagedAttention Fit In

These two ideas solve different problems and are often confused. It helps to be precise.

SystemMain JobWhy It Matters
FlashAttention Make attention computation IO-efficient with tiled execution Reduces wasteful HBM traffic by keeping working sets local to fast SRAM
PagedAttention Manage KV cache storage like virtual memory pages Reduces fragmentation and improves utilisation under high concurrency

FlashAttention is about compute locality

Naive attention materialises a large intermediate score matrix and repeatedly bounces data through HBM. FlashAttention computes attention in tiles that fit into fast on-chip SRAM.

Q tile
↓
load K/V tile
↓
compute partial attention in SRAM / registers
↓
accumulate result into running softmax
↓
advance to next tile

The key insight is that the softmax normalisation can be computed incrementally — you never need the entire row of scores in memory at once.

PagedAttention is about KV cache management

Different requests grow to different lengths, batches change dynamically, and contexts need to be allocated, extended, and sometimes shared (e.g., for prefix caching). PagedAttention treats KV as fixed-size blocks — like OS virtual memory pages — rather than demanding one contiguous buffer per sequence. That reduces fragmentation and improves reuse.

FlashAttention optimises how attention is computed.  PagedAttention optimises how KV cache is stored and managed. TurboQuant-style systems need both: paging to manage fragmentation, tiling to fuse decompression with attention.

Tiled Attention — A Deep Dive New

Tiling is the single most important idea enabling efficient compressed-KV attention. Let's go deeper than the standard summary.

Why tiling exists: the memory-bandwidth wall

An A100 GPU has ~2 TB/s of HBM bandwidth but can perform ~312 TFLOPS of FP16 compute. For a large KV cache, a naive attention pass spends most of its time waiting for memory, not computing. The arithmetic intensity of naive attention is too low — you move a lot of bytes to do very little math.

On-chip SRAM (shared memory) is ~20× faster than HBM but tiny (~192 KB per SM on an A100). Tiling is the strategy that lets a kernel exploit that fast memory despite its small size.

Tiled Attention: Data Flow Through the Memory Hierarchy HBM (off-chip, ~2 TB/s) K tile 0 (compressed) K tile 1 (compressed) V tile 0 (compressed) V tile 1 (compressed) Q (query — full prec.) load tile SRAM (~192 KB, ~20× faster) unpack bitfields apply scale / correction K tile — FP16 (temp) V tile — FP16 (temp) QKᵀ scores weighted V sum temp tile discarded after use accumulate Output (HBM) partial sums accumulated across all tiles written once repeat for each tile in the KV sequence

The tiling loop spelled out

for each Q tile (block of query vectors):
    running_sum  = 0
    running_max  = -∞

    for each KV tile in the sequence:
        load compressed K and V tile from HBM       ← only this tile
        unpack into SRAM / registers                ← temporary, local
        apply per-channel scale factors
        compute partial scores: S = Q · Kᵀ / √d
        update online softmax (running_max, running_sum)
        accumulate weighted V into output
        discard decompressed tile                   ← never lives in HBM

    write final output tile to HBM                  ← once per Q tile

Tile size: the core engineering constraint

The tile size is not arbitrary. It must satisfy a hard constraint: the temporary decompressed K and V vectors, plus the query vectors and accumulators, must fit in the available SRAM budget of one Streaming Multiprocessor.

Tile too large: spills to L2 or HBM registers, stalls execution, and destroys the bandwidth savings.
Tile too small: launch overhead dominates, warp occupancy drops, and compute efficiency falls.

The optimal tile size depends on the SRAM budget, the number of active warps, the number of heads, the head dimension, and — crucially with compressed KV — the expansion factor of the decompression step. INT4 → FP16 doubles the byte count in SRAM, so you can only fit half as many tokens in the same budget. That is why compressed-KV kernels require careful re-tuning of tile shapes and often use smaller tiles than their FP16 equivalents.

Online softmax: the algorithmic enabler

The reason tiling works across the sequence dimension (not just the batch or head dimension) is the online softmax trick from Milakov & Gimelshein (2018), formalised and popularised by the FlashAttention papers. The key identity is:

softmax(x)_i = exp(x_i - max(x)) / Σ exp(x_j - max(x))

You can update both max(x) and the normalisation constant Σ incrementally as new tiles arrive, without ever storing the full score row. This makes the memory footprint of the attention kernel O(tile_size) rather than O(sequence_length).

Where TurboQuant Fits

TurboQuant attacks a very specific bottleneck extremely well: KV cache bandwidth and storage during decode-phase inference. It is a systems optimisation, not just a compression trick.

A useful layered view:

FlashAttention      → optimises how attention is computed (tiled execution)
PagedAttention      → optimises how KV is stored and allocated (block management)
TurboQuant          → optimises how many bytes those blocks contain (low-bit KV)
                       ↑ sits on top of both, benefits from both

The appeal is clear. If you can shrink keys and values aggressively while preserving benchmark quality, you unlock:

Larger context windows Higher concurrency Less HBM pressure Better decode throughput Lower serving cost per token

TurboQuant — Advantages and Disadvantages New

Advantages

AdvantageDetail
Memory bandwidth reduction INT4 KV cuts bytes transferred per attention step by ~4×, which directly translates to higher throughput on memory-bandwidth-limited decode workloads.
Larger effective batch size Smaller KV footprint means more concurrent requests fit in HBM, improving GPU utilisation under bursty agentic traffic.
Longer context windows A 4× memory reduction roughly allows 4× longer contexts for the same GPU memory budget, enabling 128 K → 512 K context without additional hardware.
Compatible with paging Compressed KV pages are structurally identical to uncompressed ones — just smaller. PagedAttention systems need minimal changes to page compressed blocks.
Benchmark accuracy Several TurboQuant-family systems report near-lossless quality on standard language benchmarks with INT4 KV, especially with per-channel scales.
Workload-targeted KV cache is a well-scoped optimisation target. Unlike weight quantisation, KV quantisation does not affect prefill compute paths.

Disadvantages

DisadvantageWhy It Matters
Dequantisation overhead Every attention tile must unpack bits and apply scale factors. On memory-compute-balanced workloads (short context, high batch), this overhead can negate gains.
Kernel complexity Fused compress-decompress-attention kernels are significantly harder to write, validate, and maintain than standard FlashAttention kernels. Correctness bugs can be subtle.
"Zero accuracy loss" is benchmark-scoped Softmax can amplify small perturbations nonlinearly. Benchmark success ≠ lossless. Adversarial inputs or tasks that depend on precise score ratios may degrade.
Outlier sensitivity Without per-channel scales, a single high-magnitude channel forces a wide quantisation range and crushes precision everywhere else. Proper outlier handling adds metadata storage and complicates kernels further.
Prefill not helped The prefill (prompt processing) phase is typically compute-bound, not bandwidth-bound. TurboQuant only improves the decode phase, where memory bandwidth dominates.
Tile size re-tuning required INT4 → FP16 expansion inside SRAM changes the effective tile size that fits in shared memory. Existing FlashAttention tile configs are not reusable without adjustment.
Does not help training Inference-only: training still faces activations, gradients, and optimizer state. This is a targeted inference win, not a universal memory solution.

Comparing KV Compression Methods New

TurboQuant is not the only approach to KV memory reduction. The landscape includes several complementary or competing strategies, each with different tradeoffs.

KV Compression Strategy Landscape → Compression aggressiveness → Accuracy preserved FP8 KV INT8 (KIVI) TurboQuant INT4 + scales KVQuant NF4/INT2 SnapKV eviction
Method Strategy Compression Key Strength Key Weakness
FP8 KV Lower-precision storage (1 byte) 2× vs FP16 Simplest to implement; native hardware support on H100 / Blackwell Modest gain; outlier sensitivity still present
KIVI (INT8) Per-channel INT8 with residual 2× vs FP16 Strong accuracy; low kernel complexity; residual correction for outliers 2× is modest vs 4× INT4; residual adds memory
TurboQuant (INT4) Low-bit per-group quant + tiled decompress 4× vs FP16 Excellent bandwidth savings; fused decompress-attention kernel Kernel complexity; tile re-tuning needed; outlier handling required
KVQuant (NF4/INT2) Learned non-uniform quantisation (NF4) or aggressive INT2 4–8× vs FP16 Best compression ratio; NF4 adapts to KV distributions Higher accuracy risk at INT2; complex calibration
SnapKV / H2O (Eviction) Drop low-salience tokens from the cache entirely Variable (task-dependent) Can be combined with quantisation; semantic rather than numeric compression Irreversible; risky for tasks requiring exact token recall
GQA / MQA (Architecture) Share K/V across multiple query heads 8–32× on cache size Zero quantisation noise; done at model design time Requires retraining; not applicable to existing models
Learned Summarisation Compress old context into learned memory objects Potentially unbounded Handles arbitrarily long contexts; semantic lossiness can be acceptable Experimental; significant accuracy risk on long-recall tasks
Practical combinations: Production systems often stack multiple strategies — e.g., GQA to reduce heads, then INT4 quantisation to reduce per-head precision, then eviction to prune very old tokens. Each layer of optimisation is roughly multiplicative on memory savings.

Hardware-Software Co-Design New

Quantisation used to be a purely software trick applied on top of general-purpose hardware. That assumption is no longer valid. With NVIDIA Blackwell (GB200, B100) and AMD MI300X, low-bit arithmetic is first-class silicon.

Blackwell's FP4 and FP8 engines

Blackwell introduces native FP4 tensor core support. This fundamentally changes the TurboQuant math: instead of INT4 values being unpacked into FP16 before the matrix multiply, the multiply-accumulate can happen directly on FP4 operands. The practical implications:

FP4 directly in tensor cores
No explicit unpack step before GEMM. The dequantisation can be fused into the hardware pipeline rather than burning shader arithmetic. Effective throughput for KV attention roughly doubles versus software INT4.
FP8 for KV write-back
Writing freshly computed K/V vectors back to cache in FP8 (rather than FP16) halves the store bandwidth. Combined with FP4 reads, the net KV bandwidth saving approaches 8× vs a naive FP16 baseline.
KV Quantisation: Software-only vs Hardware-native Pipeline Software (pre-Blackwell) Load INT4 from HBM Unpack INT4 → FP16 (shader) FP16 GEMM in tensor cores Write FP16 to output Hardware (Blackwell) Load FP4 from HBM fused in tensor core FP4 GEMM native hardware Write FP8 to cache 3 separate steps 1 fused step

CXL and memory disaggregation

Beyond the GPU die itself, a second architectural shift is underway: Compute Express Link (CXL) allows KV cache pages to live in host DRAM and be directly accessed by the GPU without explicit CPU involvement. Combined with NVLink fabrics, this enables multi-GPU KV sharing — one server's KV cache can be read by a different GPU handling a different request batch.

This turns KV quantisation into a system-level optimisation: compressed INT4 pages are not just smaller in HBM; they are also cheaper to transfer across CXL links, reducing the bandwidth tax on disaggregated inference architectures.

Implications for kernel design

When the hardware natively handles FP4/FP8 arithmetic, the software stack needs to change too:

Bottom line: On pre-Blackwell hardware, TurboQuant is a software trade — you burn shader cycles to save bandwidth. On Blackwell, the trade nearly disappears because the hardware does both natively. Optimising for one generation's silicon may not carry forward unchanged.

The Real Tradeoffs Behind TurboQuant-Style Systems

The big story is not "compression good." The real story is that compression shifts cost from memory movement into kernel sophistication, quantisation logic, and runtime scheduling.

BenefitWhat You Pay For It
Less memory footprintMore packing and unpacking logic; scale metadata storage
Lower bandwidth demandDequantisation overhead; outlier handling
Larger contextsMore complex scheduling, paging, and tile re-tuning
Higher concurrencyMore sophisticated runtime and compiler behavior
HW-native efficiency (Blackwell)Precision format must match tensor core requirements

Why Fully Decompressing Back Into HBM Defeats the Point

One of the deepest practical questions is whether the cache must be fully restored into global memory before attention can use it. If yes, most of the gain disappears:

Bad pipeline:
load compressed KV → decompress entire KV → store in HBM → run attention
(doubles bandwidth; negates compression)
Better pipeline:
load compressed chunk → dequantize tiny tile in SRAM/registers
→ immediately use in attention math → discard temporary

The decompressed KV never exists as one giant persistent tensor. Small pieces are staged in SRAM or register files, used immediately, then thrown away.

Why SRAM is the critical staging area

HBM:             large, ~2 TB/s bandwidth, far from compute
L2 cache:        intermediate, ~20 TB/s, shared across SMs
SRAM/shared mem: small (~192 KB/SM), ~80 TB/s, private to SM
Registers:       fastest (~80+ TB/s effective), per-thread, extremely limited

The tiled decompress loop

for each KV tile:
    load packed low-bit KV from HBM
    unpack bitfields
    apply per-channel scales / corrections
    place temporary vectors in SRAM or registers
    run QKᵀ and weighted-V math
    discard temporary tile
    → continue to next tile

The tile size is chosen so the temporary decompressed working set fits the SM's SRAM budget. That sizing problem — accounting for INT4 → FP16 expansion — is one of the central engineering challenges in compressed-KV kernels.

How PagedAttention connects

PagedAttention manages the global cache organisation (which blocks live where), while tiled decompression manages the local compute path (how each block is consumed). They are complementary layers of the same stack.

Compressed KV in HBM pages Load packed page / tile Dequantize into SRAM / registers Tiled attention → discard tile Out

GPU Utilisation — The Hidden Culprit New

Engineers monitoring inference infrastructure often notice that GPU utilisation numbers (SM active %, MFU, or hardware performance counter readings) tell a confusing story: the GPU appears underloaded even during heavy traffic. KV-cache-related bottlenecks are a frequent — and frequently overlooked — root cause.

KV fragmentation causes stall bubbles

In a PagedAttention-style system, the KV cache is divided into fixed-size blocks. As requests arrive, grow, and complete, blocks become allocated and freed in non-sequential patterns. If the block allocator runs out of contiguous free regions — or if the eviction policy is too aggressive — the scheduler stalls waiting for space, leaving SMs idle.

Tools like gpu-low-util-monitor will surface this as a sustained gap between hardware capability and observed throughput. The SM utilisation counter may read 35–50% while the workload seems theoretically heavier, precisely because attention steps are blocked waiting for KV allocation, not because the model is computationally light.

Dequantisation stalls under INT4

A second, subtler pattern appears when INT4 KV kernels are deployed without careful tile tuning. If the decompression step spills from registers into shared memory — or worse, into L2 — the kernel's arithmetic pipeline stalls, dropping SM utilisation even though the GPU is nominally "running attention." This manifests as low MFU (model FLOP utilisation), not as a scheduling issue. The hardware is occupied but not computing useful work.

Diagnostic pattern: if you observe low GPU utilisation during agentic workloads with long contexts, check for (a) KV allocation fragmentation in the block manager, (b) dequantisation kernel register spills via ncu, and (c) whether the batch scheduler is padding short sequences with empty KV blocks to maintain alignment.

How compression changes the utilisation profile

Counterintuitively, adding INT4 KV compression can increase measured GPU utilisation — not because the kernel runs faster per step, but because fewer HBM stalls mean more time actually computing. On memory-bandwidth-limited workloads the GPU transitions from "waiting for bytes" to "actually multiplying," which shows up as higher SM occupancy in profiling tools.

Where This Is Heading Next

Current KV compression is powerful, but still relatively early. Several directions look especially important.

Semantic compression

Not all tokens matter equally. Low-value filler tokens could be compressed far more aggressively than critical reasoning spans, named entities, or code structure. This requires attention-aware importance scoring at write time — expensive, but increasingly tractable.

Attention-aware precision assignment

Frequently attended tokens stay high precision; low-salience tokens get downgraded. This turns the KV cache into an adaptive memory system that mirrors the model's own attention distribution.

Hierarchical KV tiers

Once you think about the KV cache as a tiered memory system, transformer inference starts to look remarkably like OS storage hierarchy design:

SRAM (on-chip)   → active tile, FP16, sub-microsecond access
HBM              → hot KV cache, FP8,  ~microsecond
DRAM / CXL       → warm KV, INT4,  ~tens of microseconds
NVMe SSD         → cold / summarised memory, token-level eviction

A useful mental model: HBM is your desk (immediate work), DRAM is your room (accessible with a few steps), SSD is a storage locker (archived, but retrievable with some latency). The engineering challenge is managing eviction and retrieval policies across these tiers without introducing unacceptable decode latency spikes.

Learned compression

The most ambitious direction is to stop storing every old token literally. Instead, future systems may store compressed semantic state, learned summaries, or dynamically maintained memory objects — analogous to how humans compress past context into high-level abstractions rather than verbatim recall.

Hardware-software co-evolution

As Blackwell and future generations standardise FP4/FP8 tensor cores, quantisation will increasingly be specified in terms of hardware tile formats rather than abstract bit widths. The design space for TurboQuant-style systems will shift from "how do we avoid the unpack penalty" to "which precision format aligns best with the hardware's native data path."

The central question is no longer just whether we can compress the KV cache. It is whether we can compress it without making attention more expensive than the bytes we saved — and whether the silicon itself will eventually make that tradeoff disappear.