Here's a number that concentrates the mind: a fleet of Llama-70B instances serving 100 concurrent users at 128k-token context generates over a terabyte of KV cache. Not model weights. Not activations. Just the attention key-value state — the temporary memory the model builds while it thinks.
And that terabyte has nowhere good to live.
High-bandwidth memory on the GPU is too small. DRAM is cheap but slow. The interconnects shuttling data between them are becoming the true throughput ceiling of modern inference. And yet nearly every production serving stack treats KV cache as a detail — a side effect of the forward pass — rather than as the primary infrastructure concern it has become.
That mismatch is the subject of this piece.
CXL exposes memory. A KV infrastructure layer should expose intent.
— the core thesisWhat KV Cache Actually Is
Before the architecture argument, a quick grounding. In a transformer doing autoregressive decoding, each new token must attend to every token that came before it. Without caching, that means recomputing the attention keys and values for the entire history on every single step — cost that scales as O(n²) in sequence length.
KV cache solves this by saving those keys and values after the first pass. Each new decode step reads from cache and only runs the attention computation for the new token.
// Without KV cache: recompute everything, every step token_N_output = attention(all_tokens_0_to_N) // O(N²) per step // With KV cache: load past, compute present kv_store[session] = cache_keys_and_values(tokens_0_to_N-1) token_N_output = attention(kv_store[session], new_token) // O(1) per step
The throughput win is dramatic. But so is the cost: the cache is a live, growing, session-bound memory object that persists for the entire life of a conversation — and that object needs somewhere to live.
The Scale Problem Is Nonlinear
Here's what makes this urgent rather than merely interesting. KV memory doesn't grow linearly with model size — it compounds with context length, concurrency, and session persistence simultaneously.
| Model | Context | Sessions | KV Memory | vs. Weights |
|---|---|---|---|---|
| Llama-8B | 32k | 100 | ~70 GB | ≈ weights |
| Llama-70B | 32k | 100 | ~400 GB | ≈ weights |
| Llama-70B | 128k | 100 | >1 TB | 4× weights |
| Persistent agent | multi-session | many users | infra-scale | unbounded |
In the 128k case, KV cache is four times the size of the model it belongs to. Any HBM-first strategy collapses here. And with agentic systems that accumulate memory across sessions, there's no natural upper bound at all.
at 128k context
70B at long context
no natural ceiling
Why Generic Memory Systems Fail
The instinct is to throw more memory at the problem. And yes — CXL-attached memory pools, DRAM expansion, NVMe spilling — these help. But they treat KV cache as anonymous bytes, and KV cache is not anonymous bytes.
Each of these failure modes is tractable in isolation. The problem is that fixing one without understanding the others produces a system that is fast at the wrong thing. What's needed is infrastructure that understands the semantics of KV cache — not just its bytes.
A Reference Architecture
The goal is not to replace HBM. Active decode always needs the fastest possible memory for its hot working set. The goal is to surround HBM with a semantically-aware memory hierarchy that keeps it from drowning in cold, reusable, compressible KV state.
Hot
Appliance
Warm
Cold
The key insight is the KV appliance layer sitting between HBM and the slower tiers. This isn't a general memory controller — it's a component that understands transformer session structure, prefix boundaries, layer granularity, and decode urgency. It does the work that keeps HBM clean.
What the Appliance Layer Does
Prefix deduplication. System prompts, shared RAG context, and template prefixes can be stored once and referenced by pointer across thousands of sessions. A single shared prefix block serving 10,000 requests avoids 10,000 redundant HBM allocations.
Semantic-aware compression. KV tensors have structure. Attention heads differ in volatility. Lower layers are more compressible than upper layers. A KV-aware compressor can apply different strategies per-head and per-layer rather than treating the tensor as flat bytes.
Predictive prefetch. Unlike generic paging, this layer knows which sessions are in active decode — and can prefetch the next KV blocks from slower tiers before the GPU asks for them, hiding latency behind computation.
Urgency-aware eviction. Eviction decisions are made with decode priority in mind. A session mid-generation is never evicted. A completed session with reusable prefix is archived, not deleted.
The ASIC Opportunity
Software-only implementations of the above are bottlenecked by the host CPU's ability to manage metadata, schedule transfers, and run compression. A KV-aware ASIC changes this.
┌─────────────────────────────────────────────────┐ │ KV Infrastructure ASIC │ ├─────────────────────────────────────────────────┤ │ PCIe / CXL / NVLink Front End │ ← fabric interface │ High-Bandwidth DMA Engines │ ← async movement │ KV Compression / Decompression Units │ ← structure-aware │ KV Metadata SRAM (session, layer, head index) │ ← fast lookup │ Prefix Dedup Engine │ ← hash + ref-count │ Prefetch Scheduler (decode urgency model) │ ← prediction engine │ QoS + Per-Session Isolation + Telemetry │ ← observability ├─────────────────────────────────────────────────┤ │ Memory Controllers: HBM │ DDR │ CXL │ NVMe │ └─────────────────────────────────────────────────┘
How This Differs From Existing Work
Several systems are moving in this direction. It's worth being precise about where they sit relative to this proposal.
| System | Primary Focus | What's Different Here |
|---|---|---|
| vLLM PagedAttention | Efficient KV paging within a single runtime | This proposal lifts KV orchestration into infrastructure-wide, cross-process, cross-host scope. |
| LMCache | KV sharing and caching across requests | Extends into dedicated memory fabrics, hardware compression offload, and ASIC-level coordination. |
| Mooncake | Distributed KV transport between inference nodes | Adds metadata-aware placement, semantic eviction, and ASIC-backed movement primitives. |
| NVIDIA Dynamo | Inference request orchestration | Focus here is the memory topology layer beneath orchestration — placement and movement, not routing. |
| TensorRT-LLM | Inference kernel optimization | Orthogonal: this is infrastructure-layer, not kernel-layer. |
The pattern across the field is that each system solves one layer of the problem well. What's argued here is that the layers need to be composed into a coherent stack with a single semantic contract: KV blocks have identity, context, priority, and reuse potential. Generic memory systems cannot express that contract. Dedicated infrastructure can.
What This Does Not Solve
Credibility requires honesty about limitations. Dedicated KV infrastructure is not a category-ending answer.
Why This Layer May Matter More Than the Next GPU
The training era taught the industry a clean lesson: more FLOPS, better models. The scaling laws held, investment was obvious, infrastructure investment concentrated around compute.
The inference era is messier. FLOPS are no longer the limiter for many deployed workloads. What limits throughput is memory residency, data movement cost, prefix reuse efficiency, and scheduling intelligence. These are infrastructure problems — and infrastructure that doesn't understand the workload can't solve them.
KV cache is, in some sense, the working memory of deployed AI. Right now, that memory is treated like anonymous heap. The argument here is that it deserves its own stack: from the silicon that moves it, to the metadata layer that tracks it, to the policies that decide its fate.
The next major AI infrastructure layer may not be another accelerator. It may be the memory orchestration layer that prevents accelerators from drowning in their own context.
Open Questions Worth Sitting With
These aren't rhetorical. They're genuine unresolved design questions that will determine the shape of this infrastructure over the next several years.
- Should KV become a first-class primitive in inference runtimes and operating systems — analogous to how virtual memory is managed by the kernel today?
- Who owns garbage collection of distributed KV state across multi-host, multi-session serving clusters — and what are the consistency guarantees?
- Can prefix reuse become cluster-wide, or does locality always win over deduplication efficiency?
- How much KV compression is acceptable before task-specific quality collapses — and can this threshold be modeled automatically?
- Should KV be exposed through a semantic API (session, prefix, head, layer) rather than raw memory addresses — and what would that API look like?
- As context windows grow toward millions of tokens, does the FLOPS-vs-memory tradeoff eventually invert completely?