MANISH AI
AI Memory Systems · Deep Dive

KV Cache Is a Memory System

Attention indexing, VRAM, HBM, huge pages, TLB pressure, and pinned memory— explained from first principles for anyone building or reasoning about modern inference stacks.

KV cache attention indexing VRAM · HBM huge pages · TLB pinned memory CPU & GPU inference

A surprising number of inference discussions blur together three very different layers: model semantics, runtime layout, and hardware memory behavior. That is why phrases like "the model attends to earlier tokens" are correct but incomplete. They describe what the model is doing conceptually, but not how the serving stack actually finds, stores, and retrieves the state that makes that attention possible.

The missing piece is that the KV cache is not just "some memory." It is a structured, growing, repeatedly accessed inference-state object that imposes real constraints on layout, translation overhead, locality, and data movement. Once you see it that way, a long chain of things starts making more sense.

Core thesis: KV cache is not merely a tensor allocation. It is a memory-system problem with implications for layout, indexing, translation overhead, locality, page size, and data movement policy.

Start with the transformer: where KV comes from

In a transformer layer the model takes an input representation and produces three projections: Q (query), K (key), and V (value). During prefill, every prompt token flows through every layer in parallel. The K and V tensors from that pass are written into the KV cache so they never need to be recomputed.

During decode, the model generates exactly one token per step. For each new token, every layer computes a fresh Q, reads back all previously cached K and V entries via attention, and then appends the new token's K and V to the layer's cache. This continues until an EOS token or a length limit.

KV cache growth — per-layer anatomy
KV cache per-layer anatomy Shows tokens flowing through transformer layers and accumulating K,V entries in each layer's cache Layer 0 Layer 1 Layer 2 K,V K,V K,V K,V K,V K,V K,V K,V K,V ··· K,V K,V K,V K,V K,V K,V K,V K,V K,V ··· K,V K,V K,V K,V K,V K,V K,V K,V K,V ··· Prefill KV (static) Decode KV (appended) T=0 T=5 (end prefill) T=8 KV size = T × L × H × d_head × 2 × dtype_bytes T=tokens L=layers H=heads Every decode step appends K,V entries to all L layer caches simultaneously

This matters because there is no single "the KV cache"—there are L independent layer caches, each growing in lockstep. For a 32-layer model running a 4K-token context at fp16 with 32 heads of dimension 128, the KV cache already occupies roughly 4K × 32 × 32 × 128 × 2 × 2 bytes ≈ 2 GB of memory. At 32K tokens that becomes 16 GB—and that is a single request.

4 K
tokens × 32 layers × fp16
≈ 2 GB KV per session
32 K
tokens (long context)
≈ 16 GB KV per session
×N
concurrent sessions
multiply all of the above

What attention indexing actually means

At the abstract model level, "attention" sounds simple: a new token's query looks back at every earlier key and value and computes a weighted sum. But a serving system cannot operate on that abstraction. It has to answer a far more concrete question on every decode step:

for request R, layer L, head-group H, token position T:
  → where are the K and V bytes stored right now?
  → in which buffer, at which offset, on which device?

That lookup problem—mapping logical sequence position to physical storage location—is what people mean by attention indexing. It is the runtime's bookkeeping system, and it is not trivial.

Logical address

Request ID, layer ID, token position, head group, block ID. This is the model-serving view of what we need.

Physical location

Buffer pointer, block offset, device allocation, host spill slot, or arena region. This is where the bytes live.

In a naive flat-tensor implementation, indexing is pure arithmetic over a contiguous stride layout. But production serving systems use more sophisticated structures because naive layouts create problems at scale: fragmentation, wasted capacity, and inability to share prefixes across requests.

Paged KV cache — block table lookup
Paged KV block table Logical sequence positions mapped through a block table to physical KV blocks in memory LOGICAL SEQUENCE T[0..15] block 0 T[16..31] block 1 T[32..47] block 2 T[48..51] block 3 ← BLOCK TABLE logical phys slot blk 0 slot 7 blk 1 slot 2 blk 2 slot 14 blk 3 slot 1 PHYSICAL KV POOL slot 1 · blk 3 (R0,L*,T48-51) slot 2 · blk 1 (R0,L*,T16-31) slot 3 · free slot 4 · free slot 5 · R1 shared prefix slot 6 · free slot 7 · blk 0 (R0,L*,T0-15) slot 8 · free slot 9 · R2 blk 0 slot 14 · blk 2 (R0,L*,T32-47) Logical sequence blocks need not be physically contiguous — the block table decouples them

A good mental model: the operating system gives you the building. The runtime decides that "request 17, layer 22, token block 5" is in shelf C, rack 9, bin 4. Attention indexing is the shelf/bin lookup logic, not the land deed.

Paged allocation, popularized by vLLM, eliminates the need to pre-reserve contiguous memory proportional to the maximum sequence length. Blocks are allocated lazily and can be shared across requests that share a common prefix—a significant win for chat systems with shared system prompts.

What GPU VRAM actually is — and what HBM is

A lot of people hear "VRAM" and "HBM" and treat them as synonyms or as competing things. They are neither. VRAM is the role. HBM is one physical implementation of that role.

GPU accelerator — memory hierarchy
GPU memory hierarchy Nested diagram showing on-die SRAM, HBM package memory, and system RAM with bandwidth annotations SYSTEM RAM (host) 32–512 GB DDR5 · ~150 GB/s · CPU-local · KV spill / host buffers GPU / ACCELERATOR PACKAGE HBM (VRAM role) 16–192 GB 1–4 TB/s BW off-die but on-package ON-DIE SRAM (L-cache / SMEM) 10–256 MB (per chip) ~10+ TB/s BW, tiny hottest local working set PCIe/NVLink ~100 GB/s
On-die SRAM

Small, extremely fast, physically on the logic die. Used for the hottest local working set during kernel execution—think shared memory in CUDA. Not where KV typically lives.

HBM (high-bandwidth memory)

Off-die but on-package, very high bandwidth, much larger than SRAM. This is the main VRAM tier in modern AI accelerators. KV cache, weights, and activations all live here during GPU inference.

So when someone says "this GPU has 80 GB of HBM," they are describing its main high-performance VRAM pool. That memory is not inside the logic die, but it is part of the accelerator's local memory domain in the packaging and runtime sense. The logic die and HBM stacks are wired together through a wide, short interconnect—often a silicon interposer—delivering bandwidth that system RAM cannot match.

Why CPU inference sees the problem differently

In CPU-only agentic AI systems on x86 or ARM, there is no separate HBM pool. The KV cache lives in ordinary system RAM. The CPU reads it through the normal cache hierarchy: L1 → L2 → L3 → DRAM. Every decode step re-reads a large portion of the model weights and the full KV cache through that same hierarchy—completely different physics from a GPU.

CPU decode — memory access path per token step
CPU decode memory path Flow showing compute core reading from L1, L2, L3, then DRAM for weights and KV cache CPU Core SIMD / AVX L1 ~4 ns · 64KB L2 ~10 ns · 2MB L3 (shared) ~30 ns · 64MB System RAM (DRAM) ~100 ns · DDR5 ~96 GB/s KV cache weights hit miss? miss? L3 MISS KV + weights typically exceed L3; each decode step causes DRAM reads — this is why CPU decode is bandwidth-bound

This is why conversations about CPU inference drift toward huge pages, translation overhead, NUMA locality, and memory residency rather than raw FLOP counts. For a large model, the bottleneck is getting bytes from DRAM to the execution units, not computing the arithmetic once the data arrives.

Huge pages are mostly about TLB reach, not magic speed

Every virtual memory access the CPU makes must be translated into a physical address. The Translation Lookaside Buffer (TLB) is a small hardware cache of recent translations. When a translation is not in the TLB, the CPU must walk the page table in memory—a multi-step process that can cost dozens of nanoseconds.

TLB reach — 4 KB vs 2 MB vs 1 GB pages
TLB reach comparison across page sizes Three side-by-side bars showing how many TLB entries are needed to cover a 2GB KV arena with different page sizes 4 KB pages 524,288 mappings for 2 GB KV 1000s of TLB misses/step TLB cap. 2 MB pages 1,024 mappings few misses — fits in TLB 1 GB pages 2 mappings trivially in TLB Workload: 2 GB KV arena · Typical L2 TLB has 1,024–4,096 entries

The key concept is TLB reach: how much virtual address space the processor can cover with the TLB entries it currently holds. With 4 KB pages, a 2 GB KV arena requires more than half a million translation entries. Even the largest TLBs have only a few thousand entries. With 2 MB huge pages, the same 2 GB fits in just over a thousand entries—likely within TLB capacity.

4 KB pages  →  524,288 mappings  for 2 GB KV
2 MB pages  →      1,024 mappings  for 2 GB KV   ← 512× fewer
1 GB pages  →          2 mappings  for 2 GB KV   ← 262,144× fewer

The important correction: huge pages during steady-state decode are mostly about translation efficiency (fewer TLB misses, fewer page-table walks), not about making RAM itself faster or avoiding page faults. The KV data is already resident. You are eliminating the overhead of finding it.

Do all Linux inference systems use large pages?

No—but many high-performance systems do, directly or indirectly. The picture differs across architectures:

x86 Linux

4 KB base pages by default. Huge page support comes on top: 2 MB via CONFIG_HUGETLB_PAGE, and 1 GB where the hardware supports it. Transparent Huge Pages (THP) can auto-promote eligible regions. x86 does not support alternative base page sizes at compile time.

ARM64 Linux

More flexible at the kernel level. Can be compiled with 4 KB, 16 KB, or 64 KB base pages. A 64 KB base page on ARM64 means ordinary allocations already have 16× the TLB reach of 4 KB x86—useful for mobile and server workloads without explicit huge pages.

Production inference stacks often combine: a hugetlbfs-backed arena for the KV pool (explicit 2 MB pages, pre-faulted), with ordinary 4 KB pages for everything else. This gives the KV region maximum TLB efficiency without complicating the rest of the allocator.

What pinned memory actually means

Pinned memory is memory the operating system is told not to move or swap out. It stays resident at a stable physical address, which is essential when a device wants to access it directly through DMA—direct memory access without involving the CPU on every transfer.

The deeper reason pinning exists: DMA engines work with physical addresses. If the OS is free to migrate or swap pages, the physical address backing a virtual mapping can change between the moment a DMA transfer is programmed and the moment it completes. Pinning prevents this. It says: "this physical page will not move until I say so."

Pinned memory is not about making bytes intrinsically faster. It is about making a region stable and directly accessible for hardware transfer paths that cannot tolerate the OS re-paging or relocating memory under them.

DMA transfer — pinned vs unpinned memory
DMA transfer pinned vs unpinned Two scenarios: left shows DMA succeeding with pinned memory, right shows DMA failing when OS migrates unpinned memory Pinned — DMA succeeds VA 0xA000 → PA 0x3C000 (pinned) PA 0x3C000 KV buffer (locked) GPU / DMA reads PA 0x3C000 OS cannot move page ✓ Transfer completes correctly Unpinned — OS may migrate VA 0xB000 → PA 0x7E000 (migrates!) PA 0x7E000 → moved by OS! GPU / DMA reads stale PA ✗ Reads wrong or stale data
Where pinned memory helps
  • CPU ↔ GPU transfer buffers
  • Host-side staging for accelerators
  • Networking / RDMA buffers
  • Disaggregated-memory transfer paths
  • Multi-device DMA pipelines
What pinned memory is not
  • Not on-die SRAM
  • Not the same as a huge page
  • Not a CPU arithmetic speedup
  • Not a cache of any kind
  • Not automatically the first win on CPU-only systems

Where KV pinning becomes genuinely useful

On a pure CPU-only agentic system, the CPU reads host RAM directly. Pinning does not change the memory bandwidth available or the cache hierarchy behavior. The real questions for CPU KV are: locality, NUMA alignment, page size, allocator fragmentation, and access regularity.

Pinning becomes more valuable the moment another device needs to DMA-access host-resident KV:

GPU with host-side KV spill

Hot KV lives in HBM; overflow or colder blocks spill to system RAM. If the GPU pulls spilled KV via DMA, pinned host memory stabilizes the transfer path and avoids re-pinning overhead on each access.

NPU / custom accelerators

Many inference cards use host RAM as a large backing tier. If the accelerator DMA-accesses that memory, pinned regions avoid physical address instability and enable safe async transfers.

Multi-device pipelines

When one device computes and another stages or transfers state, pinned buffers provide predictable DMA-visible regions that the runtime can safely schedule against.

RDMA / disaggregated memory

Zero-copy or RDMA-style flows require pinned host buffers. The remote NIC needs a stable physical address to initiate the transfer without CPU involvement.

So what should you optimize first?

The right optimization order depends entirely on where your KV cache lives and what accesses it:

CPU-only agentic systems

GPU / accelerator systems with host interaction

The big picture

Once you connect all the pieces, the inference memory story becomes much clearer. Attention indexing is the runtime's mapping problem—a bookkeeping system that converts logical sequence positions into physical storage addresses. VRAM is the memory role; HBM is one physical implementation of that role. Huge pages help KV cache mostly by increasing TLB reach and reducing translation overhead—not by making RAM itself faster. Pinned memory matters most when host-resident KV or staging buffers need stable, device-visible access paths for DMA.

Concept map — how everything connects
KV cache memory systems concept map Central KV cache node connected to attention indexing, VRAM/HBM, TLB/huge pages, and pinned memory KV Cache structured inference state Attention Indexing logical → physical lookup VRAM / HBM role vs. physical impl. Huge Pages · TLB translation efficiency Pinned Memory stable DMA access path

The more you think about KV this way, the less it looks like a passive cache and the more it looks like what it really is: a first-class memory system for autoregressive inference. Each new capability—longer contexts, more concurrent sessions, larger models—hits this memory system first. The teams that understand it as a systems problem, not just a tensor sizing question, are the ones that will navigate those constraints most cleanly.

That is why a serious conversation about modern inference eventually stops being about "how big is the model?" and becomes a more grounded systems conversation: