AI Systems · KV Cache · Memory Hierarchy

The Next AI Bottleneck Isn't
FLOPs.
It's Attention Memory.

A full-stack, systems-level deep dive into why KV cache growth is becoming the central constraint in large-model inference — and what the industry is doing to escape it.

May 12, 2026

⏱ 18 min read

📐 Systems depth: High

🎯 Infra · Research · Founders

KV Cache Linear Attention SSM / Mamba PagedAttention FlashAttention GQA / MQA CXL Memory Retrieval-Augmented

Inference Bottleneck Migration FLOPs → Memory

§ 01 · The Problem

The Uncomfortable Scaling Wall

For a decade the AI story was seductively simple: more parameters, more data, more compute. Researchers scaled, things got better. But production inference is exposing a different, far more expensive constraint — the cost of remembering context.

In a transformer decoder, every generated token produces two vectors: a Key and a Value. These are written to the KV cache and held in GPU HBM so future tokens can attend back to the past. This isn't an implementation detail you can ignore — it is the live memory state of the conversation, and it grows continuously.

Weights are large, but weights are static during inference. You load them once, they sit on chip, done. The KV cache is fundamentally different. It grows with every new token, multiplies across every concurrent user, branches across every agentic loop, and must be accessible at memory-bus latency on every decode step.

The key mental model: model weights are the static executable; KV cache is the dynamic working set. You can think of it like RAM vs a compiled binary — you only need one copy of the binary, but RAM grows with every process, thread, and session.

The Numbers Are Brutal

~2.5 GiB

KV cache for a single 8K-token conversation on a 70B GQA model (FP16)

~80 GiB

KV cache for 32 concurrent users at 8K tokens — fills an H100 SXM alone

~320 GiB

KV cache for 32 users at 32K context — KV must now be sharded or offloaded

>640 GiB

Agentic workload at 32K tokens with 64 active reasoning branches

Illustrative estimates: ~80 layers, 8 KV heads, 128 head dim, 2 bytes/value (BF16), standard GQA. Multi-head attention can be 8–16× larger; MQA reduces it further.

Illustrative KV Cache Sizes70B-class GQA decoder · BF16 KV

Scenario	Context	Concurrency	Est. KV Cache	System implication
Single chatbot turn	2K tokens	1 user	~640 MiB	Trivial. Weights dominate.
Long chat session	8K tokens	1 user	~2.5 GiB	Manageable, but multiplies at batch scale.
Production batch	8K tokens	32 users	~80 GiB	KV alone can fill an H100 80GB. Weights must share space.
Long-context assistant	32K tokens	32 users	~320 GiB	Requires multi-GPU KV sharding or offload.
Agentic workflow	32K + branches	64 branches	>640 GiB	Branches fork KV. Memory grows super-linearly with reasoning depth.
1M context (frontier)	1M tokens	8 users	>2 TiB	Impossible in a single GPU's HBM. Requires offload, compression, or retrieval.

KV Cache: Every Token Leaves Memory Behindper layer · per user · per branch

Three Flavors of Pressure

💾

Capacity Pressure

KV state competes directly with model weights, activations, batching buffers, and runtime overhead inside scarce, expensive HBM. You can't simply buy your way out — an H100 SXM has 80GB, and that number isn't growing as fast as context windows.

⚡

Bandwidth Pressure

Even when the KV fits, repeatedly reading it turns decode into a memory-bandwidth bottleneck rather than a tensor-core bottleneck. At 3.35 TB/s (H100 HBM3), you can saturate bandwidth long before you saturate FLOP capacity — leaving those $30K tensor cores idle.

🗂

Scheduling Pressure

Serving systems must continuously decide what stays hot, what gets paged, what gets offloaded, and which requests can be batched without wrecking latency SLOs. This is a new, complex optimization problem that didn't exist in the compute-bound era.

§ 02 · The Math

Why Standard Attention Creates This Problem

The attention mechanism is one of the most beautiful ideas in modern machine learning. It is also, at scale, one of the most expensive memory operations ever deployed in production.

When a new token is generated, the model asks: "which prior tokens should I weight most heavily?" It answers by computing a dot product between the current query vector and every previously computed key vector, applying softmax, and mixing value vectors by those weights.

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

Where:
  Q = query vector for current token [1 × d_k]
  K = all prior key vectors [N × d_k] ← grows with context
  V = all prior value vectors [N × d_v] ← grows with context
  N = sequence length (can be 1M+ at frontier)
  d_k = attention head dimension (typically 64–128)

Memory per decode step = 2 × N × d_k × num_heads × num_layers × bytes_per_element

That equation looks elegant. In systems terms, it means: to generate each new token, the inference engine must load all prior K and V vectors from memory. Every. Single. Step. There is no shortcut in vanilla attention — you cannot skip tokens, summarize them on the fly, or avoid the read. The past must be present.

The bandwidth trap: For a 70B model at 8K context, decoding a single token requires reading roughly ~800MB of KV data per decode step. At H100's 3.35 TB/s HBM bandwidth, that's ~0.24ms of pure memory reads — before any compute. At 32K context, it's ~3.2ms per step, limiting you to ~312 tokens/sec theoretical max from bandwidth alone, regardless of FLOP capacity.

The Quadratic Problem at Training vs. the Linear Problem at Inference

There's a common confusion here worth untangling. Training attention is O(N²) in compute and memory — every token must attend to every other token, forming an N×N attention matrix. This is why long-context training was hard.

Inference is different. You only generate one token at a time, so the attention matrix becomes O(N) per decode step, not O(N²). But you pay O(N) in memory reads on every single step. For a 512-token generation at 32K context, that's 512 × 32K reads — effectively O(N×L) total memory bandwidth, where L is generation length.

Training Cost

O(N²) compute and memory for full attention matrix. This is why Flash Attention, ring attention, and sequence parallelism matter for training.

Training: O(N²·d) FLOPs per layer

Inference Cost

O(N) memory reads per decode step. Simple, but brutal at scale — every generated token reads the entire history. Memory-bound, not compute-bound.

Decode: O(N·d·L) memory ops (L = generation length)

KV Size Formula — Make It Concrete

Here's the exact formula for how much memory the KV cache consumes:

KV_bytes = 2 × L × N × H_kv × d_head × bytes_per_element

Where:
  2 = K tensor + V tensor
  L = number of transformer layers (e.g., 80 for a 70B model)
  N = sequence length (tokens in context)
  H_kv = number of KV heads (8 for GQA-8, 1 for MQA, 64 for MHA)
  d_head = head dimension (128 typical)
  bytes = 2 for BF16/FP16, 1 for INT8, 0.5 for INT4

Example (70B GQA-8, 32K ctx, BF16):
  = 2 × 80 × 32768 × 8 × 128 × 2 ≈ 10.7 GB

MHA vs. GQA vs. MQA: The Head Sharing Spectrum

One of the most impactful recent changes to production models has been reducing H_kv without sacrificing quality. This is the GQA/MQA story.

KV Head ArchitecturesMHA → GQA → MQA

GQA (used in Llama 3, Mistral, Gemma) is now the dominant production choice — it cuts KV size by 4–8× with minimal quality loss. MQA is faster but shows more quality degradation at long context.

§ 03 · Engineering Fixes

What Engineers Did First: Optimize Within the Paradigm

Before anyone redesigned attention itself, a generation of brilliant inference engineers found ways to make standard attention far more efficient. These optimizations are now table-stakes for any serious serving stack.

FlashAttention: Rewrite the Memory Access Pattern

FlashAttention (Dao et al., 2022, 2023) doesn't change the mathematical output of attention at all. It changes how attention is computed — specifically, how it interacts with the GPU memory hierarchy.

The key insight: instead of computing the full N×N attention matrix and writing it to HBM, FlashAttention tiles the computation into blocks that fit in SRAM (the fast on-chip cache). This dramatically reduces HBM reads and writes, which are the expensive operations.

Standard Attention I/O

Reads Q, K, V from HBM. Writes N×N attention matrix to HBM. Reads it back to apply softmax. Reads again to multiply with V. Total: O(N²) HBM reads/writes.

FlashAttention I/O

Tiles Q, K, V into SRAM blocks. Computes incrementally. Never materializes full N×N matrix in HBM. Total: O(N) HBM reads/writes. 2–4× faster wall-clock on long sequences.

FlashAttention's contribution: It solved the training and prefill memory problem. But it does not fundamentally reduce how much KV state must exist and grow during decode. The KV cache remains, and still scales linearly with context and concurrency.

PagedAttention: vLLM and the OS Insight

The next major insight came not from ML research but from OS design. The vLLM team (Kwon et al., 2023) noticed that traditional KV cache management was like a pre-virtual-memory operating system: memory was allocated in large contiguous chunks, leading to massive fragmentation and waste.

PagedAttention maps non-contiguous physical memory blocks to logical KV pages — just like virtual memory pages in an OS. This allows:

Prefix Sharing

Requests sharing the same system prompt share physical KV pages. One copy in memory serves many users simultaneously.

Copy-on-Write Branching

Agentic branches fork KV pages cheaply, copying only on modification. Tree-structured KV for beam search and reasoning trees.

Dynamic Allocation

No pre-commitment to maximum sequence length. Pages are allocated and freed dynamically, cutting fragmentation from ~60% to near zero.

Speculative Decoding: Amortize the Memory Reads

Speculative decoding runs a small draft model to generate candidate tokens cheaply, then verifies them in parallel with the large model. The key insight for memory: you're amortizing the expensive KV read across multiple tokens per verification step, improving hardware utilization at long context.

A Timeline of Production Attention Optimizations

2017 → Attention Is All You Need

Transformer + KV cache introduced. Elegant, powerful, and unknowingly set the stage for a decade of memory pressure at scale.

2021 → Multi-Query Attention (Shazeer)

Sharing one KV head across all query heads. Cuts KV size by ~8×. First major architectural response to KV cost, at some quality cost.

2022 → FlashAttention v1 (Dao et al.)

IO-aware attention, 2–4× prefill speedup. Rewrites memory access patterns using tiling and SRAM. Now universal in training and prefill.

2023 → PagedAttention / vLLM (Kwon et al.)

Virtual memory for KV cache. Eliminates fragmentation, enables prefix sharing and CoW branching. Became the dominant open-source serving framework.

2023 → GQA (Ainslie et al.)

Grouped-Query Attention: the production sweet spot. K/V head groups of 4–8 cut cache by 4–8× with minimal quality degradation. Now default in Llama 3, Mistral, Gemma, and most frontier models.

2024–2025 → KV Quantization + Speculative Decoding at Scale

INT8/INT4 KV compression, draft models. Further halving or quartering cache size. Speculative decoding amortizes memory reads across tokens. Still additive optimizations on top of the same fundamental architecture.

The ceiling: All of these optimizations are improvements within the standard attention paradigm. They push the wall back. They don't remove it. KV cache still scales linearly with context length and concurrency. At million-token contexts and thousands of concurrent users, even a perfectly optimized transformer hits a physical memory limit. That's why the industry is now looking at changing attention itself.

§ 04 · New Architectures

What "Changing Attention Itself" Really Means

It means escaping the assumption that the model must explicitly retain and attend to every previous token at full resolution. The question is: how do you change this without destroying model quality?

There are multiple paths being explored simultaneously. None are free. Each makes a different trade between accuracy, recall fidelity, hardware compatibility, training stability, and serving cost. The field is genuinely open.

The Landscape: An Honest Comparison

Standard Attention
Transformer (MHA/GQA)

✓ Maximum expressiveness

✓ Exact recall of any token

✓ Well-understood training dynamics

✓ Universal hardware support

✗ O(N) memory per decode step

✗ KV grows unboundedly

✗ Bandwidth-bound at long ctx

Linear Attention
RWKV, RetNet, GLA, etc.

✓ O(1) memory at inference

✓ Bounded KV state (fixed size)

✓ Predictable memory scaling

~ Streaming-friendly

✗ Approximate: may lose recall

✗ Quality gap on retrieval tasks

✗ Training harder to stabilize

State Space Models
Mamba, Mamba-2, S4, etc.

✓ O(1) inference memory

✓ Streaming / sequential-native

✓ Strong on continuous signals

✓ Excellent for audio/genomics

~ Competitive with transformers at 3B scale

✗ Compressed state = lossy recall

✗ Quality gap widens at scale for text

Hybrid Architectures
Jamba, Zamba, Griffin, etc.

✓ Best of both worlds

✓ Full attention for critical layers

✓ SSM/linear for cheap layers

~ Reduced KV vs pure transformer

~ More complex training

✗ Still accumulates some KV

✗ Less hardware-optimized than pure paths

Sliding-Window Attention: The Bounded Working Set

Sliding-window attention (used in Mistral's original architecture, Longformer, BigBird) keeps only the last W tokens in the active attention window. This creates a hard bound on KV residency: O(W) instead of O(N).

The tradeoff is clear: long-range dependencies can disappear unless complemented by summary tokens, periodic global attention layers, or retrieval. This is exactly the design in Mistral's architecture — most layers use sliding window, but occasional layers use full attention to pass long-range signal through.

Sliding Window vs. Full Attentionbounded active memory

State Space Models: Compress History Into a Fixed-Size State

SSMs like the Mamba family take a fundamentally different approach. Instead of caching all prior tokens, the model distills its understanding of the past into a compact, fixed-size state vector that gets updated on each new token.

Think of it like the difference between a stenographer who writes down every word (transformers) and a person who takes notes and integrates new information into an existing mental model (SSMs). The mental model is limited — you won't recall verbatim text from three hours ago — but it's far more memory-efficient.

Transformer Decode

y_t = Attention(q_t, K_{1..t}, V_{1..t})
KV grows: O(t) memory

SSM / Mamba Decode

h_t = A·h_{t-1} + B·x_t
y_t = C·h_t
h is fixed size: O(1) memory

The systems intuition is powerful: SSMs become streaming engines. They read input, update a compact state, and carry it forward. No KV cache to grow. No bandwidth bottleneck from re-reading past tokens. The catch: that compressed state can lose information that full attention would have retained, particularly for tasks requiring verbatim retrieval or long-range needle-in-haystack recall.

Hierarchical Attention: Memory With Levels

Another path — arguably the most intellectually satisfying — is to build the model's memory with explicit levels, much like a CPU cache hierarchy.

L1: Active Window

Recent N tokens. Full KV resolution. Exact attention. Expensive but necessary for immediate coherence.

L2: Compressed Memory

Older text collapsed into summary vectors, topic embeddings, or compressed latent blocks. Cheaper reads, some fidelity loss.

L3: Cold Storage

Documents, prior conversations, long-term context. Lives off GPU in DRAM or SSD. Fetched only when needed.

Retrieval Path

Semantic search over L3. The model queries what it needs, rather than keeping everything hot.

"The future of attention may not be 'look at everything.' It may be 'remember selectively, compress aggressively, and retrieve precisely.'"

§ 05 · Systems Impact

The AI Stack Starts Looking Like a Memory System

If attention changes, the entire hardware and software stack changes with it. This is where the infrastructure opportunity becomes enormous — and where the next decade of systems innovation will live.

Today we talk about "the model" as if intelligence lives in a single artifact. But modern long-context inference is already a distributed runtime, and this is only going to deepen:

Future Inference Memory Hierarchynot everything belongs in HBM

CXL: The Memory Fabric That Changes the Equation

CXL (Compute Express Link) is a relatively new interconnect standard that enables CPU and GPU to share a common memory pool with far lower latency than PCIe or network-attached storage. For AI inference, this is potentially transformative:

KV Cache Disaggregation

Offload warm KV blocks to CXL-attached DRAM at near-HBM-latency. GPU HBM holds only the hottest blocks; the rest spill to a shared memory pool accessible by all GPUs in the rack.

Elastic Memory Pools

Multiple GPUs can share a CXL memory fabric, dynamically allocating KV capacity to whichever requests need it most. No more per-GPU waste from pre-committed context windows.

Near-Memory Compute

CXL memory modules with embedded processing can compress/decompress KV blocks on the fly, keeping raw throughput high without GPUs burning cycles on memory management.

Prefill/Decode Disaggregation

One of the most important recent architectural shifts in serving systems: separating the prefill phase (processing the input prompt) from the decode phase (generating output tokens). These two phases have completely different hardware profiles:

Prefill Phase

Compute-bound. Processes all prompt tokens in parallel. High FLOP utilization. Benefits from large batch sizes and tensor parallelism. Latency-sensitive (user is waiting).

Decode Phase

Memory-bandwidth-bound. Generates one token at a time. Low FLOP utilization. Bottlenecked by KV reads. Benefits from batching many requests to amortize bandwidth costs.

Disaggregated serving (used in systems like DistServe, Mooncake) routes these phases to different hardware pools, improving overall cluster utilization and throughput dramatically.

The intelligence is moving to the system: when the model stores state in a retrieval index instead of KV cache, intelligence is no longer purely inside the weights. Vector databases, memory managers, rerankers, and tool-routing layers become first-class parts of the intelligence stack. This has profound implications for model evaluation — a bad retrieval system makes a good model look dumb.

§ 06 · Takeaways

What This Means for You

This is not just an academic architecture debate. The memory bottleneck has immediate, practical implications depending on where you sit in the AI stack.

If you're an infra / serving engineer

KV cache residency, fragmentation rates, prefill/decode separation, cache hit rates, and HBM bandwidth utilization are now first-class SLO inputs — as important as GPU utilization. Track them obsessively.

Evaluate PagedAttention-style serving, prefix caching, KV quantization (INT8/INT4 KV is often free in quality), and speculative decoding for your workloads now.

Watch CXL memory hardware. It could change your cluster design substantially within 2–3 years.

If you're fine-tuning or evaluating models

Don't optimize only for benchmark accuracy. Evaluate serving cost at your actual context length distribution. A model that scores 2 points higher on benchmarks but uses 4× the KV memory may be the wrong choice for production.

Test long-context quality under KV quantization — some models degrade more than others at INT8 KV. Test with realistic concurrency, not single-request benchmarks.

If you're a founder or investor

The opportunity is not just bigger models. Memory-aware runtimes, KV compression algorithms, CXL pooled memory hardware, prefix sharing systems, and inference orchestration are all areas where the problem is well-defined and the value is enormous.

Reasoning logs as cold memory: agent traces need to stay auditable, searchable, and replayable even after they leave HBM. That creates a new class of persistent AI memory infrastructure.

The meta-observation: the winners in the next phase of AI infrastructure may not simply be whoever has the best model. They may be whoever combines the best model architecture with the smartest memory orchestration. Training-time improvements and inference-time memory systems are becoming co-designed. The line between "the model" and "the system" is blurring.

Final Thought

The first era of modern AI scaling was compute-centric. Bigger clusters, more FLOPs, faster chips. The next era may be memory-centric. Not memory in the philosophical sense — but literally, physically: where bytes live, how fast they move, and how intelligently they are organized.

Attention made transformers powerful because it gave them the ability to dynamically use any part of their context. That same mechanism is now becoming a scaling wall. KV cache growth, bandwidth saturation, context length explosion, and concurrency are jointly transforming inference into a memory-systems problem.

Changing attention itself — through linear kernels, state compression, retrieval augmentation, hierarchical memory, or hybrid architectures — is not just a model research topic. It is a full-stack infrastructure shift touching GPU design, HBM capacity, CXL fabrics, SmartNICs, KV schedulers, vector indices, and serving runtimes simultaneously.

"The next breakthrough may not come from making models merely bigger. It may come from making memory smarter."

📐

Go Deeper: Architecture

Mamba paper (Gu & Dao, 2023) · FlashAttention-2 · GQA paper · RetNet

⚙️

Go Deeper: Systems

vLLM / PagedAttention · DistServe · Mooncake · SGLang · TensorRT-LLM

🔩

Go Deeper: Hardware

CXL Consortium specs · HBM4 roadmap · Groq LPU architecture · Cerebras wafer-scale