MAN\SH AI / Writings

· AI Memory Infrastructure · 15 min read

Memory Systems · Inference Infrastructure · Architecture

The Memory Crisis
Slowing Every AI
Inference System

We spent the training era obsessing over FLOPS. The inference era has a different bottleneck — and it's not compute. It's what we do with KV cache.

Deep Technical · ~15 min read · KV Infrastructure V3

Here's a number that concentrates the mind: a fleet of Llama-70B instances serving 100 concurrent users at 128k-token context generates over a terabyte of KV cache. Not model weights. Not activations. Just the attention key-value state — the temporary memory the model builds while it thinks.

And that terabyte has nowhere good to live.

High-bandwidth memory on the GPU is too small. DRAM is cheap but slow. The interconnects shuttling data between them are becoming the true throughput ceiling of modern inference. And yet nearly every production serving stack treats KV cache as a detail — a side effect of the forward pass — rather than as the primary infrastructure concern it has become.

That mismatch is the subject of this piece.

CXL exposes memory. A KV infrastructure layer should expose intent.

— the core thesis

What KV Cache Actually Is

Before the architecture argument, a quick grounding. In a transformer doing autoregressive decoding, each new token must attend to every token that came before it. Without caching, that means recomputing the attention keys and values for the entire history on every single step — cost that scales as O(n²) in sequence length.

KV cache solves this by saving those keys and values after the first pass. Each new decode step reads from cache and only runs the attention computation for the new token.

conceptual — without vs. with KV cache
// Without KV cache: recompute everything, every step
token_N_output = attention(all_tokens_0_to_N)   // O(N²) per step

// With KV cache: load past, compute present
kv_store[session] = cache_keys_and_values(tokens_0_to_N-1)
token_N_output = attention(kv_store[session], new_token)  // O(1) per step

The throughput win is dramatic. But so is the cost: the cache is a live, growing, session-bound memory object that persists for the entire life of a conversation — and that object needs somewhere to live.

The Scale Problem Is Nonlinear

Here's what makes this urgent rather than merely interesting. KV memory doesn't grow linearly with model size — it compounds with context length, concurrency, and session persistence simultaneously.

Model Context Sessions KV Memory vs. Weights
Llama-8B 32k 100 ~70 GB ≈ weights
Llama-70B 32k 100 ~400 GB ≈ weights
Llama-70B 128k 100 >1 TB 4× weights
Persistent agent multi-session many users infra-scale unbounded

In the 128k case, KV cache is four times the size of the model it belongs to. Any HBM-first strategy collapses here. And with agentic systems that accumulate memory across sessions, there's no natural upper bound at all.

KV vs weight size
at 128k context
>1 TB
KV for 100 sessions
70B at long context
↑∞
Agentic growth
no natural ceiling

Why Generic Memory Systems Fail

The instinct is to throw more memory at the problem. And yes — CXL-attached memory pools, DRAM expansion, NVMe spilling — these help. But they treat KV cache as anonymous bytes, and KV cache is not anonymous bytes.

🧊
HBM Is Capacity-Constrained
H100 offers ~80 GB of HBM. A single 70B long-context session can consume half of it. Scaling concurrency on HBM alone is a dead end.
🔌
CXL Is Semantically Blind
CXL expands capacity, but doesn't know a KV tensor from a framebuffer. It can't prefetch intelligently, compress aware of transformer structure, or prioritize reuse.
🚚
Data Movement Dominates
Energy and latency from moving KV across PCIe or CXL increasingly outweighs the cost of the attention computation itself. The bottleneck has shifted.
♻️
Reuse Goes Undetected
In multi-user deployments, system prompts and RAG context are often identical across thousands of requests. Generic memory has no mechanism to deduplicate them.
🗑️
LRU Eviction Is Wrong
Least-recently-used eviction policies don't know which KV blocks are needed next. A decode in progress should never be evicted — but LRU can't distinguish.
🗺️
Topology Is Invisible
Moving KV within a GPU is free. Across NVLink is fast. Over PCIe is slow. Over a network fabric is painful. Generic allocators don't track this hierarchy.

Each of these failure modes is tractable in isolation. The problem is that fixing one without understanding the others produces a system that is fast at the wrong thing. What's needed is infrastructure that understands the semantics of KV cache — not just its bytes.

A Reference Architecture

The goal is not to replace HBM. Active decode always needs the fastest possible memory for its hot working set. The goal is to surround HBM with a semantically-aware memory hierarchy that keeps it from drowning in cold, reusable, compressible KV state.

KV Infrastructure Stack — Reference Architecture
Tier 1
Hot
GPU HBM Active decode KV Current token window ~3.35 TB/s bandwidth · ~80 GB capacity
↕ DMA · ultra-low latency prefetch
KV Layer
Appliance
Metadata engine Prefix dedup Inline compression Prefetch scheduler QoS + telemetry
↕ PCIe / CXL / NVLink
Tier 2
Warm
CXL / DDR Pool Warm session KV Shared prefix pool ~100–200 GB/s · multi-TB capacity
↕ Spill path · async archival
Tier 3
Cold
NVMe / Object Store Persistent sessions Archived agent memory ~7 GB/s · PB-scale capacity

The key insight is the KV appliance layer sitting between HBM and the slower tiers. This isn't a general memory controller — it's a component that understands transformer session structure, prefix boundaries, layer granularity, and decode urgency. It does the work that keeps HBM clean.

What the Appliance Layer Does

Prefix deduplication. System prompts, shared RAG context, and template prefixes can be stored once and referenced by pointer across thousands of sessions. A single shared prefix block serving 10,000 requests avoids 10,000 redundant HBM allocations.

Semantic-aware compression. KV tensors have structure. Attention heads differ in volatility. Lower layers are more compressible than upper layers. A KV-aware compressor can apply different strategies per-head and per-layer rather than treating the tensor as flat bytes.

Predictive prefetch. Unlike generic paging, this layer knows which sessions are in active decode — and can prefetch the next KV blocks from slower tiers before the GPU asks for them, hiding latency behind computation.

Urgency-aware eviction. Eviction decisions are made with decode priority in mind. A session mid-generation is never evicted. A completed session with reusable prefix is archived, not deleted.

The ASIC Opportunity

Software-only implementations of the above are bottlenecked by the host CPU's ability to manage metadata, schedule transfers, and run compression. A KV-aware ASIC changes this.

Key Framing
A KV ASIC is not a GPU competitor. It's a semantic memory controller — the kind of purpose-built offload chip that historically appears when a specific workload grows large enough to justify dedicated silicon. Think: what a NIC did for networking, what a SSD controller did for storage.
KV Infrastructure ASIC — functional block diagram
┌─────────────────────────────────────────────────┐
│           KV Infrastructure ASIC                │
├─────────────────────────────────────────────────┤
│  PCIe / CXL / NVLink Front End                  │  ← fabric interface
│  High-Bandwidth DMA Engines                     │  ← async movement
│  KV Compression / Decompression Units           │  ← structure-aware
│  KV Metadata SRAM (session, layer, head index)  │  ← fast lookup
│  Prefix Dedup Engine                            │  ← hash + ref-count
│  Prefetch Scheduler (decode urgency model)      │  ← prediction engine
│  QoS + Per-Session Isolation + Telemetry        │  ← observability
├─────────────────────────────────────────────────┤
│  Memory Controllers: HBM │ DDR │ CXL │ NVMe     │
└─────────────────────────────────────────────────┘

How This Differs From Existing Work

Several systems are moving in this direction. It's worth being precise about where they sit relative to this proposal.

System Primary Focus What's Different Here
vLLM PagedAttention Efficient KV paging within a single runtime This proposal lifts KV orchestration into infrastructure-wide, cross-process, cross-host scope.
LMCache KV sharing and caching across requests Extends into dedicated memory fabrics, hardware compression offload, and ASIC-level coordination.
Mooncake Distributed KV transport between inference nodes Adds metadata-aware placement, semantic eviction, and ASIC-backed movement primitives.
NVIDIA Dynamo Inference request orchestration Focus here is the memory topology layer beneath orchestration — placement and movement, not routing.
TensorRT-LLM Inference kernel optimization Orthogonal: this is infrastructure-layer, not kernel-layer.

The pattern across the field is that each system solves one layer of the problem well. What's argued here is that the layers need to be composed into a coherent stack with a single semantic contract: KV blocks have identity, context, priority, and reuse potential. Generic memory systems cannot express that contract. Dedicated infrastructure can.

What This Does Not Solve

Credibility requires honesty about limitations. Dedicated KV infrastructure is not a category-ending answer.

Honest Caveats
Each of these deserves attention before production deployment:
HBM Remains Essential
The hot-tier requirement doesn't go away. Active decode always needs HBM-level bandwidth. This reduces pressure; it doesn't eliminate the need.
Compression Has Quality Limits
Aggressive KV quantization and compression degrade perplexity. The quality-efficiency tradeoff is model- and task-specific and must be characterized empirically.
Metadata Consistency Is Hard
Distributed KV metadata across host processes and machines introduces a consistency problem. Incorrect reuse can corrupt inference. This is a serious correctness risk.
Observability Is Not Optional
Without session-level telemetry, the system becomes an opaque black box. Debugging eviction decisions requires instrumentation that must be designed in from the start.
Fabric Pressure Persists
PCIe and CXL bandwidth limits don't disappear. Intelligent movement reduces pressure; it can't exceed physical interconnect limits.
System Complexity Grows
Adding an infrastructure layer adds failure modes. Each new component — appliance, metadata store, dedup engine — is a new thing that can break or misbehave.

Why This Layer May Matter More Than the Next GPU

The training era taught the industry a clean lesson: more FLOPS, better models. The scaling laws held, investment was obvious, infrastructure investment concentrated around compute.

The inference era is messier. FLOPS are no longer the limiter for many deployed workloads. What limits throughput is memory residency, data movement cost, prefix reuse efficiency, and scheduling intelligence. These are infrastructure problems — and infrastructure that doesn't understand the workload can't solve them.

KV cache is, in some sense, the working memory of deployed AI. Right now, that memory is treated like anonymous heap. The argument here is that it deserves its own stack: from the silicon that moves it, to the metadata layer that tracks it, to the policies that decide its fate.

The next major AI infrastructure layer may not be another accelerator. It may be the memory orchestration layer that prevents accelerators from drowning in their own context.

Open Questions Worth Sitting With

These aren't rhetorical. They're genuine unresolved design questions that will determine the shape of this infrastructure over the next several years.

  1. Should KV become a first-class primitive in inference runtimes and operating systems — analogous to how virtual memory is managed by the kernel today?
  2. Who owns garbage collection of distributed KV state across multi-host, multi-session serving clusters — and what are the consistency guarantees?
  3. Can prefix reuse become cluster-wide, or does locality always win over deduplication efficiency?
  4. How much KV compression is acceptable before task-specific quality collapses — and can this threshold be modeled automatically?
  5. Should KV be exposed through a semantic API (session, prefix, head, layer) rather than raw memory addresses — and what would that API look like?
  6. As context windows grow toward millions of tokens, does the FLOPS-vs-memory tradeoff eventually invert completely?