AI Systems · KV Cache · Memory Hierarchy

The Next AI Bottleneck Isn't
FLOPs.
It's Attention Memory.

A full-stack, systems-level deep dive into why KV cache growth is becoming the central constraint in large-model inference — and what the industry is doing to escape it.

⏱ 18 min read
📐 Systems depth: High
🎯 Infra · Research · Founders
KV Cache Linear Attention SSM / Mamba PagedAttention FlashAttention GQA / MQA CXL Memory Retrieval-Augmented
Inference Bottleneck Migration FLOPs → Memory
0 25 50 75 100 Bottleneck pressure (relative) 1K ctx 4K 16K 64K 256K Context length × Concurrency economic crossover static model weights dynamic KV cache

The Uncomfortable Scaling Wall

For a decade the AI story was seductively simple: more parameters, more data, more compute. Researchers scaled, things got better. But production inference is exposing a different, far more expensive constraint — the cost of remembering context.

In a transformer decoder, every generated token produces two vectors: a Key and a Value. These are written to the KV cache and held in GPU HBM so future tokens can attend back to the past. This isn't an implementation detail you can ignore — it is the live memory state of the conversation, and it grows continuously.

Weights are large, but weights are static during inference. You load them once, they sit on chip, done. The KV cache is fundamentally different. It grows with every new token, multiplies across every concurrent user, branches across every agentic loop, and must be accessible at memory-bus latency on every decode step.

The key mental model: model weights are the static executable; KV cache is the dynamic working set. You can think of it like RAM vs a compiled binary — you only need one copy of the binary, but RAM grows with every process, thread, and session.

The Numbers Are Brutal

~2.5 GiB
KV cache for a single 8K-token conversation on a 70B GQA model (FP16)
~80 GiB
KV cache for 32 concurrent users at 8K tokens — fills an H100 SXM alone
~320 GiB
KV cache for 32 users at 32K context — KV must now be sharded or offloaded
>640 GiB
Agentic workload at 32K tokens with 64 active reasoning branches

Illustrative estimates: ~80 layers, 8 KV heads, 128 head dim, 2 bytes/value (BF16), standard GQA. Multi-head attention can be 8–16× larger; MQA reduces it further.

Illustrative KV Cache Sizes70B-class GQA decoder · BF16 KV
ScenarioContextConcurrencyEst. KV CacheSystem implication
Single chatbot turn2K tokens1 user~640 MiBTrivial. Weights dominate.
Long chat session8K tokens1 user~2.5 GiBManageable, but multiplies at batch scale.
Production batch8K tokens32 users~80 GiBKV alone can fill an H100 80GB. Weights must share space.
Long-context assistant32K tokens32 users~320 GiBRequires multi-GPU KV sharding or offload.
Agentic workflow32K + branches64 branches>640 GiBBranches fork KV. Memory grows super-linearly with reasoning depth.
1M context (frontier)1M tokens8 users>2 TiBImpossible in a single GPU's HBM. Requires offload, compression, or retrieval.
KV Cache: Every Token Leaves Memory Behindper layer · per user · per branch
Every token appends K + V to every layer Decode step N requires reading all prior tokens across all layers — O(N × layers) memory reads per token. Layer 1 t₁ t₂ t₃ t₄ t₅ t₆ ··· K+V per token, this layer Layer 2 t₁ t₂ t₃ t₄ t₅ t₆ ··· identical sequence, new layer weights Layer N t₁ t₂ t₃ t₄ t₅ t₆ ··· multiplied across full model depth Total KV Size L × N × d

Three Flavors of Pressure

💾

Capacity Pressure

KV state competes directly with model weights, activations, batching buffers, and runtime overhead inside scarce, expensive HBM. You can't simply buy your way out — an H100 SXM has 80GB, and that number isn't growing as fast as context windows.

Bandwidth Pressure

Even when the KV fits, repeatedly reading it turns decode into a memory-bandwidth bottleneck rather than a tensor-core bottleneck. At 3.35 TB/s (H100 HBM3), you can saturate bandwidth long before you saturate FLOP capacity — leaving those $30K tensor cores idle.

🗂

Scheduling Pressure

Serving systems must continuously decide what stays hot, what gets paged, what gets offloaded, and which requests can be batched without wrecking latency SLOs. This is a new, complex optimization problem that didn't exist in the compute-bound era.

Why Standard Attention Creates This Problem

The attention mechanism is one of the most beautiful ideas in modern machine learning. It is also, at scale, one of the most expensive memory operations ever deployed in production.

When a new token is generated, the model asks: "which prior tokens should I weight most heavily?" It answers by computing a dot product between the current query vector and every previously computed key vector, applying softmax, and mixing value vectors by those weights.

Attention(Q, K, V) = softmax( QKᵀ / √dk ) · V

Where:
  Q = query vector for current token [1 × d_k]
  K = all prior key vectors [N × d_k] ← grows with context
  V = all prior value vectors [N × d_v] ← grows with context
  N = sequence length (can be 1M+ at frontier)
  d_k = attention head dimension (typically 64–128)

Memory per decode step = 2 × N × d_k × num_heads × num_layers × bytes_per_element

That equation looks elegant. In systems terms, it means: to generate each new token, the inference engine must load all prior K and V vectors from memory. Every. Single. Step. There is no shortcut in vanilla attention — you cannot skip tokens, summarize them on the fly, or avoid the read. The past must be present.

The bandwidth trap: For a 70B model at 8K context, decoding a single token requires reading roughly ~800MB of KV data per decode step. At H100's 3.35 TB/s HBM bandwidth, that's ~0.24ms of pure memory reads — before any compute. At 32K context, it's ~3.2ms per step, limiting you to ~312 tokens/sec theoretical max from bandwidth alone, regardless of FLOP capacity.

The Quadratic Problem at Training vs. the Linear Problem at Inference

There's a common confusion here worth untangling. Training attention is O(N²) in compute and memory — every token must attend to every other token, forming an N×N attention matrix. This is why long-context training was hard.

Inference is different. You only generate one token at a time, so the attention matrix becomes O(N) per decode step, not O(N²). But you pay O(N) in memory reads on every single step. For a 512-token generation at 32K context, that's 512 × 32K reads — effectively O(N×L) total memory bandwidth, where L is generation length.

Training Cost

O(N²) compute and memory for full attention matrix. This is why Flash Attention, ring attention, and sequence parallelism matter for training.

Training: O(N²·d) FLOPs per layer

Inference Cost

O(N) memory reads per decode step. Simple, but brutal at scale — every generated token reads the entire history. Memory-bound, not compute-bound.

Decode: O(N·d·L) memory ops (L = generation length)

KV Size Formula — Make It Concrete

Here's the exact formula for how much memory the KV cache consumes:

KV_bytes = 2 × L × N × H_kv × d_head × bytes_per_element

Where:
  2 = K tensor + V tensor
  L = number of transformer layers (e.g., 80 for a 70B model)
  N = sequence length (tokens in context)
  H_kv = number of KV heads (8 for GQA-8, 1 for MQA, 64 for MHA)
  d_head = head dimension (128 typical)
  bytes = 2 for BF16/FP16, 1 for INT8, 0.5 for INT4

Example (70B GQA-8, 32K ctx, BF16):
  = 2 × 80 × 32768 × 8 × 128 × 2 ≈ 10.7 GB

MHA vs. GQA vs. MQA: The Head Sharing Spectrum

One of the most impactful recent changes to production models has been reducing H_kv without sacrificing quality. This is the GQA/MQA story.

KV Head ArchitecturesMHA → GQA → MQA
Multi-Head Attention H_q = H_kv = 8 Q heads KV heads (×8) KV = 100% Grouped-Query Attention H_q = 8, H_kv = 2 (groups of 4) KV heads (×2 shared) KV = 25% ✓ Multi-Query Attention H_q = 8, H_kv = 1 1 KV head (fully shared) KV = 12.5% ⚠ quality tradeoff

GQA (used in Llama 3, Mistral, Gemma) is now the dominant production choice — it cuts KV size by 4–8× with minimal quality loss. MQA is faster but shows more quality degradation at long context.

What Engineers Did First: Optimize Within the Paradigm

Before anyone redesigned attention itself, a generation of brilliant inference engineers found ways to make standard attention far more efficient. These optimizations are now table-stakes for any serious serving stack.

FlashAttention: Rewrite the Memory Access Pattern

FlashAttention (Dao et al., 2022, 2023) doesn't change the mathematical output of attention at all. It changes how attention is computed — specifically, how it interacts with the GPU memory hierarchy.

The key insight: instead of computing the full N×N attention matrix and writing it to HBM, FlashAttention tiles the computation into blocks that fit in SRAM (the fast on-chip cache). This dramatically reduces HBM reads and writes, which are the expensive operations.

Standard Attention I/O

Reads Q, K, V from HBM. Writes N×N attention matrix to HBM. Reads it back to apply softmax. Reads again to multiply with V. Total: O(N²) HBM reads/writes.

FlashAttention I/O

Tiles Q, K, V into SRAM blocks. Computes incrementally. Never materializes full N×N matrix in HBM. Total: O(N) HBM reads/writes. 2–4× faster wall-clock on long sequences.

FlashAttention's contribution: It solved the training and prefill memory problem. But it does not fundamentally reduce how much KV state must exist and grow during decode. The KV cache remains, and still scales linearly with context and concurrency.

PagedAttention: vLLM and the OS Insight

The next major insight came not from ML research but from OS design. The vLLM team (Kwon et al., 2023) noticed that traditional KV cache management was like a pre-virtual-memory operating system: memory was allocated in large contiguous chunks, leading to massive fragmentation and waste.

PagedAttention maps non-contiguous physical memory blocks to logical KV pages — just like virtual memory pages in an OS. This allows:

Prefix Sharing

Requests sharing the same system prompt share physical KV pages. One copy in memory serves many users simultaneously.

Copy-on-Write Branching

Agentic branches fork KV pages cheaply, copying only on modification. Tree-structured KV for beam search and reasoning trees.

Dynamic Allocation

No pre-commitment to maximum sequence length. Pages are allocated and freed dynamically, cutting fragmentation from ~60% to near zero.

Speculative Decoding: Amortize the Memory Reads

Speculative decoding runs a small draft model to generate candidate tokens cheaply, then verifies them in parallel with the large model. The key insight for memory: you're amortizing the expensive KV read across multiple tokens per verification step, improving hardware utilization at long context.

A Timeline of Production Attention Optimizations

2017 → Attention Is All You Need
Transformer + KV cache introduced. Elegant, powerful, and unknowingly set the stage for a decade of memory pressure at scale.
2021 → Multi-Query Attention (Shazeer)
Sharing one KV head across all query heads. Cuts KV size by ~8×. First major architectural response to KV cost, at some quality cost.
2022 → FlashAttention v1 (Dao et al.)
IO-aware attention, 2–4× prefill speedup. Rewrites memory access patterns using tiling and SRAM. Now universal in training and prefill.
2023 → PagedAttention / vLLM (Kwon et al.)
Virtual memory for KV cache. Eliminates fragmentation, enables prefix sharing and CoW branching. Became the dominant open-source serving framework.
2023 → GQA (Ainslie et al.)
Grouped-Query Attention: the production sweet spot. K/V head groups of 4–8 cut cache by 4–8× with minimal quality degradation. Now default in Llama 3, Mistral, Gemma, and most frontier models.
2024–2025 → KV Quantization + Speculative Decoding at Scale
INT8/INT4 KV compression, draft models. Further halving or quartering cache size. Speculative decoding amortizes memory reads across tokens. Still additive optimizations on top of the same fundamental architecture.

The ceiling: All of these optimizations are improvements within the standard attention paradigm. They push the wall back. They don't remove it. KV cache still scales linearly with context length and concurrency. At million-token contexts and thousands of concurrent users, even a perfectly optimized transformer hits a physical memory limit. That's why the industry is now looking at changing attention itself.

What "Changing Attention Itself" Really Means

It means escaping the assumption that the model must explicitly retain and attend to every previous token at full resolution. The question is: how do you change this without destroying model quality?

There are multiple paths being explored simultaneously. None are free. Each makes a different trade between accuracy, recall fidelity, hardware compatibility, training stability, and serving cost. The field is genuinely open.

The Landscape: An Honest Comparison

Standard Attention
Transformer (MHA/GQA)
Maximum expressiveness
Exact recall of any token
Well-understood training dynamics
Universal hardware support
O(N) memory per decode step
KV grows unboundedly
Bandwidth-bound at long ctx
Linear Attention
RWKV, RetNet, GLA, etc.
O(1) memory at inference
Bounded KV state (fixed size)
Predictable memory scaling
~ Streaming-friendly
Approximate: may lose recall
Quality gap on retrieval tasks
Training harder to stabilize
State Space Models
Mamba, Mamba-2, S4, etc.
O(1) inference memory
Streaming / sequential-native
Strong on continuous signals
Excellent for audio/genomics
~ Competitive with transformers at 3B scale
Compressed state = lossy recall
Quality gap widens at scale for text
Hybrid Architectures
Jamba, Zamba, Griffin, etc.
Best of both worlds
Full attention for critical layers
SSM/linear for cheap layers
~ Reduced KV vs pure transformer
~ More complex training
Still accumulates some KV
Less hardware-optimized than pure paths

Sliding-Window Attention: The Bounded Working Set

Sliding-window attention (used in Mistral's original architecture, Longformer, BigBird) keeps only the last W tokens in the active attention window. This creates a hard bound on KV residency: O(W) instead of O(N).

The tradeoff is clear: long-range dependencies can disappear unless complemented by summary tokens, periodic global attention layers, or retrieval. This is exactly the design in Mistral's architecture — most layers use sliding window, but occasional layers use full attention to pass long-range signal through.

Sliding Window vs. Full Attentionbounded active memory
Full Attention all prior tokens active all in KV cache Memory: O(N) grows forever Sliding Window Attention only last W tokens active in full resolution old old t₋₃ t₋₂ t₋₁ t₀ new compressed/dropped active window Memory: O(W) bounded, predictable

State Space Models: Compress History Into a Fixed-Size State

SSMs like the Mamba family take a fundamentally different approach. Instead of caching all prior tokens, the model distills its understanding of the past into a compact, fixed-size state vector that gets updated on each new token.

Think of it like the difference between a stenographer who writes down every word (transformers) and a person who takes notes and integrates new information into an existing mental model (SSMs). The mental model is limited — you won't recall verbatim text from three hours ago — but it's far more memory-efficient.

Transformer Decode

y_t = Attention(q_t, K_{1..t}, V_{1..t})
KV grows: O(t) memory

SSM / Mamba Decode

h_t = A·h_{t-1} + B·x_t
y_t = C·h_t
h is fixed size: O(1) memory

The systems intuition is powerful: SSMs become streaming engines. They read input, update a compact state, and carry it forward. No KV cache to grow. No bandwidth bottleneck from re-reading past tokens. The catch: that compressed state can lose information that full attention would have retained, particularly for tasks requiring verbatim retrieval or long-range needle-in-haystack recall.

Hierarchical Attention: Memory With Levels

Another path — arguably the most intellectually satisfying — is to build the model's memory with explicit levels, much like a CPU cache hierarchy.

L1: Active Window

Recent N tokens. Full KV resolution. Exact attention. Expensive but necessary for immediate coherence.

L2: Compressed Memory

Older text collapsed into summary vectors, topic embeddings, or compressed latent blocks. Cheaper reads, some fidelity loss.

L3: Cold Storage

Documents, prior conversations, long-term context. Lives off GPU in DRAM or SSD. Fetched only when needed.

Retrieval Path

Semantic search over L3. The model queries what it needs, rather than keeping everything hot.

"The future of attention may not be 'look at everything.' It may be 'remember selectively, compress aggressively, and retrieve precisely.'"

The AI Stack Starts Looking Like a Memory System

If attention changes, the entire hardware and software stack changes with it. This is where the infrastructure opportunity becomes enormous — and where the next decade of systems innovation will live.

Today we talk about "the model" as if intelligence lives in a single artifact. But modern long-context inference is already a distributed runtime, and this is only going to deepen:

Future Inference Memory Hierarchynot everything belongs in HBM
Inference becomes a memory-hierarchy problem Exact, compressed, warm, cold, and retrieved memory — each tier with different latency, cost, and fidelity. TIER 1 — HBM (GPU On-Device) Active KV blocks · current batch · model weights · activations | ~3 TB/s bandwidth · ~80–192 GB capacity TIER 2 — DRAM / CXL Pooled Memory Warm KV cache · compressed summaries · spill buffers · prefill cache | ~300 GB/s · TB scale · lower latency than NVMe TIER 3 — NVMe / Object Storage Cold memory · long-term documents · agent traces · audit logs | GB/s bandwidth · PB scale · high latency TIER 4 — Vector Index / Retrieval System Semantic memory · episodic recall · tool context · multi-agent shared memory | query-latency access KV evict archive

CXL: The Memory Fabric That Changes the Equation

CXL (Compute Express Link) is a relatively new interconnect standard that enables CPU and GPU to share a common memory pool with far lower latency than PCIe or network-attached storage. For AI inference, this is potentially transformative:

KV Cache Disaggregation

Offload warm KV blocks to CXL-attached DRAM at near-HBM-latency. GPU HBM holds only the hottest blocks; the rest spill to a shared memory pool accessible by all GPUs in the rack.

Elastic Memory Pools

Multiple GPUs can share a CXL memory fabric, dynamically allocating KV capacity to whichever requests need it most. No more per-GPU waste from pre-committed context windows.

Near-Memory Compute

CXL memory modules with embedded processing can compress/decompress KV blocks on the fly, keeping raw throughput high without GPUs burning cycles on memory management.

Prefill/Decode Disaggregation

One of the most important recent architectural shifts in serving systems: separating the prefill phase (processing the input prompt) from the decode phase (generating output tokens). These two phases have completely different hardware profiles:

Prefill Phase

Compute-bound. Processes all prompt tokens in parallel. High FLOP utilization. Benefits from large batch sizes and tensor parallelism. Latency-sensitive (user is waiting).

Decode Phase

Memory-bandwidth-bound. Generates one token at a time. Low FLOP utilization. Bottlenecked by KV reads. Benefits from batching many requests to amortize bandwidth costs.

Disaggregated serving (used in systems like DistServe, Mooncake) routes these phases to different hardware pools, improving overall cluster utilization and throughput dramatically.

The intelligence is moving to the system: when the model stores state in a retrieval index instead of KV cache, intelligence is no longer purely inside the weights. Vector databases, memory managers, rerankers, and tool-routing layers become first-class parts of the intelligence stack. This has profound implications for model evaluation — a bad retrieval system makes a good model look dumb.

What This Means for You

This is not just an academic architecture debate. The memory bottleneck has immediate, practical implications depending on where you sit in the AI stack.

If you're an infra / serving engineer

KV cache residency, fragmentation rates, prefill/decode separation, cache hit rates, and HBM bandwidth utilization are now first-class SLO inputs — as important as GPU utilization. Track them obsessively.

Evaluate PagedAttention-style serving, prefix caching, KV quantization (INT8/INT4 KV is often free in quality), and speculative decoding for your workloads now.

Watch CXL memory hardware. It could change your cluster design substantially within 2–3 years.

If you're fine-tuning or evaluating models

Don't optimize only for benchmark accuracy. Evaluate serving cost at your actual context length distribution. A model that scores 2 points higher on benchmarks but uses 4× the KV memory may be the wrong choice for production.

Test long-context quality under KV quantization — some models degrade more than others at INT8 KV. Test with realistic concurrency, not single-request benchmarks.

If you're a founder or investor

The opportunity is not just bigger models. Memory-aware runtimes, KV compression algorithms, CXL pooled memory hardware, prefix sharing systems, and inference orchestration are all areas where the problem is well-defined and the value is enormous.

Reasoning logs as cold memory: agent traces need to stay auditable, searchable, and replayable even after they leave HBM. That creates a new class of persistent AI memory infrastructure.

The meta-observation: the winners in the next phase of AI infrastructure may not simply be whoever has the best model. They may be whoever combines the best model architecture with the smartest memory orchestration. Training-time improvements and inference-time memory systems are becoming co-designed. The line between "the model" and "the system" is blurring.

Final Thought

The first era of modern AI scaling was compute-centric. Bigger clusters, more FLOPs, faster chips. The next era may be memory-centric. Not memory in the philosophical sense — but literally, physically: where bytes live, how fast they move, and how intelligently they are organized.

Attention made transformers powerful because it gave them the ability to dynamically use any part of their context. That same mechanism is now becoming a scaling wall. KV cache growth, bandwidth saturation, context length explosion, and concurrency are jointly transforming inference into a memory-systems problem.

Changing attention itself — through linear kernels, state compression, retrieval augmentation, hierarchical memory, or hybrid architectures — is not just a model research topic. It is a full-stack infrastructure shift touching GPU design, HBM capacity, CXL fabrics, SmartNICs, KV schedulers, vector indices, and serving runtimes simultaneously.

"The next breakthrough may not come from making models merely bigger. It may come from making memory smarter."
📐

Go Deeper: Architecture

Mamba paper (Gu & Dao, 2023) · FlashAttention-2 · GQA paper · RetNet

⚙️

Go Deeper: Systems

vLLM / PagedAttention · DistServe · Mooncake · SGLang · TensorRT-LLM

🔩

Go Deeper: Hardware

CXL Consortium specs · HBM4 roadmap · Groq LPU architecture · Cerebras wafer-scale