KV Cache Is a Memory System | Writings

A surprising number of inference discussions blur together three very different layers: model semantics, runtime layout, and hardware memory behavior. That is why phrases like "the model attends to earlier tokens" are correct but incomplete. They describe what the model is doing conceptually, but not how the serving stack actually finds, stores, and retrieves the state that makes that attention possible.

The missing piece is that the KV cache is not just "some memory." It is a structured, growing, repeatedly accessed inference-state object that imposes real constraints on layout, translation overhead, locality, and data movement. Once you see it that way, a long chain of things starts making more sense.

Core thesis: KV cache is not merely a tensor allocation. It is a memory-system problem with implications for layout, indexing, translation overhead, locality, page size, and data movement policy.

Start with the transformer: where KV comes from

In a transformer layer the model takes an input representation and produces three projections: Q (query), K (key), and V (value). During prefill, every prompt token flows through every layer in parallel. The K and V tensors from that pass are written into the KV cache so they never need to be recomputed.

During decode, the model generates exactly one token per step. For each new token, every layer computes a fresh Q, reads back all previously cached K and V entries via attention, and then appends the new token's K and V to the layer's cache. This continues until an EOS token or a length limit.

KV cache growth — per-layer anatomy

This matters because there is no single "the KV cache"—there are L independent layer caches, each growing in lockstep. For a 32-layer model running a 4K-token context at fp16 with 32 heads of dimension 128, the KV cache already occupies roughly 4K × 32 × 32 × 128 × 2 × 2 bytes ≈ 2 GB of memory. At 32K tokens that becomes 16 GB—and that is a single request.

4 K

tokens × 32 layers × fp16
≈ 2 GB KV per session

32 K

tokens (long context)
≈ 16 GB KV per session

×N

concurrent sessions
multiply all of the above

What attention indexing actually means

At the abstract model level, "attention" sounds simple: a new token's query looks back at every earlier key and value and computes a weighted sum. But a serving system cannot operate on that abstraction. It has to answer a far more concrete question on every decode step:

for request R, layer L, head-group H, token position T:
  → where are the K and V bytes stored right now?
  → in which buffer, at which offset, on which device?

That lookup problem—mapping logical sequence position to physical storage location—is what people mean by attention indexing. It is the runtime's bookkeeping system, and it is not trivial.

Logical address

Request ID, layer ID, token position, head group, block ID. This is the model-serving view of what we need.

Physical location

Buffer pointer, block offset, device allocation, host spill slot, or arena region. This is where the bytes live.

In a naive flat-tensor implementation, indexing is pure arithmetic over a contiguous stride layout. But production serving systems use more sophisticated structures because naive layouts create problems at scale: fragmentation, wasted capacity, and inability to share prefixes across requests.

Paged KV cache — block table lookup

A good mental model: the operating system gives you the building. The runtime decides that "request 17, layer 22, token block 5" is in shelf C, rack 9, bin 4. Attention indexing is the shelf/bin lookup logic, not the land deed.

Paged allocation, popularized by vLLM, eliminates the need to pre-reserve contiguous memory proportional to the maximum sequence length. Blocks are allocated lazily and can be shared across requests that share a common prefix—a significant win for chat systems with shared system prompts.

What GPU VRAM actually is — and what HBM is

A lot of people hear "VRAM" and "HBM" and treat them as synonyms or as competing things. They are neither. VRAM is the role. HBM is one physical implementation of that role.

GPU accelerator — memory hierarchy

On-die SRAM

Small, extremely fast, physically on the logic die. Used for the hottest local working set during kernel execution—think shared memory in CUDA. Not where KV typically lives.

HBM (high-bandwidth memory)

Off-die but on-package, very high bandwidth, much larger than SRAM. This is the main VRAM tier in modern AI accelerators. KV cache, weights, and activations all live here during GPU inference.

So when someone says "this GPU has 80 GB of HBM," they are describing its main high-performance VRAM pool. That memory is not inside the logic die, but it is part of the accelerator's local memory domain in the packaging and runtime sense. The logic die and HBM stacks are wired together through a wide, short interconnect—often a silicon interposer—delivering bandwidth that system RAM cannot match.

Why CPU inference sees the problem differently

In CPU-only agentic AI systems on x86 or ARM, there is no separate HBM pool. The KV cache lives in ordinary system RAM. The CPU reads it through the normal cache hierarchy: L1 → L2 → L3 → DRAM. Every decode step re-reads a large portion of the model weights and the full KV cache through that same hierarchy—completely different physics from a GPU.

CPU decode — memory access path per token step

This is why conversations about CPU inference drift toward huge pages, translation overhead, NUMA locality, and memory residency rather than raw FLOP counts. For a large model, the bottleneck is getting bytes from DRAM to the execution units, not computing the arithmetic once the data arrives.

Huge pages are mostly about TLB reach, not magic speed

Every virtual memory access the CPU makes must be translated into a physical address. The Translation Lookaside Buffer (TLB) is a small hardware cache of recent translations. When a translation is not in the TLB, the CPU must walk the page table in memory—a multi-step process that can cost dozens of nanoseconds.

TLB reach — 4 KB vs 2 MB vs 1 GB pages

The key concept is TLB reach: how much virtual address space the processor can cover with the TLB entries it currently holds. With 4 KB pages, a 2 GB KV arena requires more than half a million translation entries. Even the largest TLBs have only a few thousand entries. With 2 MB huge pages, the same 2 GB fits in just over a thousand entries—likely within TLB capacity.

4 KB pages  →  524,288 mappings  for 2 GB KV
2 MB pages  →      1,024 mappings  for 2 GB KV   ← 512× fewer
1 GB pages  →          2 mappings  for 2 GB KV   ← 262,144× fewer

The important correction: huge pages during steady-state decode are mostly about translation efficiency (fewer TLB misses, fewer page-table walks), not about making RAM itself faster or avoiding page faults. The KV data is already resident. You are eliminating the overhead of finding it.

Do all Linux inference systems use large pages?

No—but many high-performance systems do, directly or indirectly. The picture differs across architectures:

x86 Linux

4 KB base pages by default. Huge page support comes on top: 2 MB via CONFIG_HUGETLB_PAGE, and 1 GB where the hardware supports it. Transparent Huge Pages (THP) can auto-promote eligible regions. x86 does not support alternative base page sizes at compile time.

ARM64 Linux

More flexible at the kernel level. Can be compiled with 4 KB, 16 KB, or 64 KB base pages. A 64 KB base page on ARM64 means ordinary allocations already have 16× the TLB reach of 4 KB x86—useful for mobile and server workloads without explicit huge pages.

Production inference stacks often combine: a hugetlbfs-backed arena for the KV pool (explicit 2 MB pages, pre-faulted), with ordinary 4 KB pages for everything else. This gives the KV region maximum TLB efficiency without complicating the rest of the allocator.

What pinned memory actually means

Pinned memory is memory the operating system is told not to move or swap out. It stays resident at a stable physical address, which is essential when a device wants to access it directly through DMA—direct memory access without involving the CPU on every transfer.

The deeper reason pinning exists: DMA engines work with physical addresses. If the OS is free to migrate or swap pages, the physical address backing a virtual mapping can change between the moment a DMA transfer is programmed and the moment it completes. Pinning prevents this. It says: "this physical page will not move until I say so."

Pinned memory is not about making bytes intrinsically faster. It is about making a region stable and directly accessible for hardware transfer paths that cannot tolerate the OS re-paging or relocating memory under them.

DMA transfer — pinned vs unpinned memory

Where pinned memory helps

CPU ↔ GPU transfer buffers
Host-side staging for accelerators
Networking / RDMA buffers
Disaggregated-memory transfer paths
Multi-device DMA pipelines

What pinned memory is not

Not on-die SRAM
Not the same as a huge page
Not a CPU arithmetic speedup
Not a cache of any kind
Not automatically the first win on CPU-only systems

Where KV pinning becomes genuinely useful

On a pure CPU-only agentic system, the CPU reads host RAM directly. Pinning does not change the memory bandwidth available or the cache hierarchy behavior. The real questions for CPU KV are: locality, NUMA alignment, page size, allocator fragmentation, and access regularity.

Pinning becomes more valuable the moment another device needs to DMA-access host-resident KV:

GPU with host-side KV spill

Hot KV lives in HBM; overflow or colder blocks spill to system RAM. If the GPU pulls spilled KV via DMA, pinned host memory stabilizes the transfer path and avoids re-pinning overhead on each access.

NPU / custom accelerators

Many inference cards use host RAM as a large backing tier. If the accelerator DMA-accesses that memory, pinned regions avoid physical address instability and enable safe async transfers.

Multi-device pipelines

When one device computes and another stages or transfers state, pinned buffers provide predictable DMA-visible regions that the runtime can safely schedule against.

RDMA / disaggregated memory

Zero-copy or RDMA-style flows require pinned host buffers. The remote NIC needs a stable physical address to initiate the transfer without CPU involvement.

So what should you optimize first?

The right optimization order depends entirely on where your KV cache lives and what accesses it:

CPU-only agentic systems

1Good KV layout and arena allocation — eliminate fragmentation and poor strides
2NUMA locality — pin threads to the socket whose memory holds the KV
3Huge pages / TLB reach — 2 MB pages for the KV arena, pre-faulted
4Cache-friendly access patterns — block the attention kernel for L2/L3 reuse
5KV compression or quantization — INT8 KV halves memory bandwidth demand
6Pinning — only if residency guarantees or special transfer paths justify it

GPU / accelerator systems with host interaction

1Device-local hot set placement — keep the working KV in HBM
2Host spill policy — decide eviction order intelligently (LRU, beam priority)
3Pinned transfer buffers — or pinned host KV regions for DMA-accessed spill
4Page size and translation strategy — 2 MB pages for host staging regions
5Runtime indexing and block mapping efficiency — minimize table lookup overhead

The big picture

Once you connect all the pieces, the inference memory story becomes much clearer. Attention indexing is the runtime's mapping problem—a bookkeeping system that converts logical sequence positions into physical storage addresses. VRAM is the memory role; HBM is one physical implementation of that role. Huge pages help KV cache mostly by increasing TLB reach and reducing translation overhead—not by making RAM itself faster. Pinned memory matters most when host-resident KV or staging buffers need stable, device-visible access paths for DMA.

Concept map — how everything connects

The more you think about KV this way, the less it looks like a passive cache and the more it looks like what it really is: a first-class memory system for autoregressive inference. Each new capability—longer contexts, more concurrent sessions, larger models—hits this memory system first. The teams that understand it as a systems problem, not just a tensor sizing question, are the ones that will navigate those constraints most cleanly.

That is why a serious conversation about modern inference eventually stops being about "how big is the model?" and becomes a more grounded systems conversation:

Where does the state live—device-local or host-resident?
How is it indexed—flat tensor or paged block table?
How many translation entries does the working set consume?
What is hot versus spillable? What is the eviction policy?
When does stability of a buffer matter more than flexibility?
How do data movement costs compound across steps and sessions?