Why This Problem Gets Big So Fast
Long-context inference looks innocent right up until you write down the memory math. For a large grouped-query transformer, a million-token KV cache in raw FP16 can already push into the few-hundred-gigabyte range per active sequence, and denser multi-head layouts can drift toward TB-class footprints. Even when low-bit KV compression cuts that dramatically, the hard part does not disappear. The hard part becomes how to store, route, prefetch, and consume that state without drowning in HBM traffic.
This is why contiguous per-request KV buffers stop feeling elegant in production. Real serving stacks have requests of different lengths, requests that share prefixes, requests that terminate early, and requests that arrive in bursts. The KV cache is no longer just a tensor. It is a live, multi-tenant memory object under continuous growth and repeated reread.
Why PagedAttention Exists At All
The core insight behind PagedAttention is that a logical sequence does not need to be physically contiguous. It can be represented as a stream of fixed-size blocks, with a small mapping structure translating logical token ranges to physical KV pages. That makes append-heavy decode cheaper, cuts fragmentation, and opens the door to prefix sharing.
Once you represent KV this way, the runtime begins to resemble a memory manager. The sequences feel contiguous to the model, but physically they are assembled from reusable blocks. That is the move that makes shared prefixes and dynamic growth manageable.
What the Ideal Stack Looks Like
The cleanest design splits responsibility three ways. The runtime owns allocation, mapping, sharing, and reclaim. The kernel owns tiled attention execution and local staging. The hardware owns high-bandwidth fetch, low-bit unpacking, page-translation assist, and enough SRAM to keep the expanded working set off HBM.
| Layer | What it should own |
|---|---|
| Runtime / software | Fixed-size KV blocks, page tables, prefix sharing, age/hotness metadata, reclaim and spill policy |
| Kernels | FlashAttention-style tiles, local decompression, QKᵀ accumulation, weighted-V accumulation, overlap of fetch and compute |
| Hardware | HBM bandwidth, larger per-SM SRAM, low-bit unpack engines, page-translation assist, large L2, CXL-aware cold-page movement |
What the Runtime Should Actually Do
A serious runtime should behave less like a tensor allocator and more like a specialized KV operating system. It should allocate fixed-size blocks, maintain a compact logical-to-physical map, optimize for cheap append during decode, and keep enough metadata to understand page precision, heat, age, shareability, and eviction status.
- It should optimize for append-heavy decode rather than arbitrary in-place mutation.
- It should treat shared prompt prefixes as native shared physical pages rather than copied buffers.
- It should know whether a page is FP16, FP8, INT8, or INT4 and whether that page is hot enough to keep near the GPU.
- It should organize free lists so completed sessions return capacity immediately without requiring heavy compaction.
This is also the layer where policy gets interesting. Once page temperature and reuse become visible, the scheduler can start forming batches partly by memory locality instead of purely by token count.
What the Kernel Should Actually Do
The kernel path should feel like FlashAttention extended into a paged and compressed world. A query tile becomes active, the runtime exposes the relevant page mapping, compressed KV blocks are prefetched from HBM, the active tile is unpacked into SRAM and registers, attention math runs, and the temporary tile is discarded.
By low-bit KV compression here, I mean the broad family of KV-cache schemes that store state in FP8, INT8, groupwise INT4, or similar compressed forms. The exact codec can vary. The invariant is what matters: keep the KV compressed in HBM, and only inflate the part of it that fits into an on-chip tile.
Why SRAM Is the Center of Gravity
HBM gets all the attention because it is large and bandwidth-rich, but the most important part of the story is what happens after the bytes leave HBM. If compressed KV is going to pay off, the expanded working set must live briefly in SRAM or registers and then disappear. If it gets materialized back into a giant HBM tensor, the whole point is lost.
Example dequantized tile tile_tokens = 128 kv_heads = 8 head_dim = 128 bytes_per = 2 # FP16 temporary K + V bytes = 128 * 8 * 128 * 2 * 2 = 524,288 bytes ≈ 512 KB
That half-megabyte estimate is for one useful K/V tile before counting query fragments, softmax state, scale metadata, overlap buffers, or safety margin. That is why a future-focused inference accelerator wants something like 512 KB to 1 MB per SM as a comfortable floor, with 1 MB to 2 MB starting to feel genuinely generous for compressed attention.
Concrete Hardware Targets
| Resource | Desirable target | Why it matters |
|---|---|---|
| Per-SM SRAM / shared memory | 512 KB to 1 MB desirable, 2 MB excellent | Room for useful K/V tiles plus overlap and control state |
| L2 cache | 64 MB to 256 MB class | Hot page metadata and shared prefixes become worth keeping close |
| HBM bandwidth | HBM3e-class and above, 3 TB/s minimum, 5 TB/s+ ideal | Even compressed decode remains memory-sensitive |
| KV block geometry | 32 or 64 tokens often attractive | Good balance between metadata cost and scheduling flexibility |
| Compressed KV format | FP8 / INT8 baseline, INT4 for colder pages | Lets runtime vary precision by age, heat, and reuse |
| Metadata overhead | Below 2% of total KV capacity | The page table must not become a second memory problem |
If the title promises hardware and software, this is where the hardware side needs to land. The ideal device wants not just more HBM, but larger SRAM per SM, native low-bit unpack and scale units, better support for block-stream gather, and some lightweight translation assist so page walks do not become the moral equivalent of TLB misses in a hot decode loop.
What Can Go Wrong
Elegant systems stories get stronger when the pain points are explicit. Paging everything sounds beautiful until the tradeoffs arrive.
TLB-like lookup overhead
The moment logical token ranges must be translated into physical blocks, you create the equivalent of a page walk. If translation metadata is cache-cold or awkwardly structured, the kernel can stall on bookkeeping before it reaches useful math.
Page size versus tile size mismatch
Smaller pages are great for sharing and fragmentation, but they increase translation overhead. Larger pages stream better, but they can mismatch the kernel’s actual tile size and create underused tails.
Compression accuracy versus decode latency
More aggressive KV compression reduces HBM traffic, but raises unpacking, scale-application, and correction costs. If dequantization costs more than the bytes saved, the win evaporates.
Prefix sharing versus isolation
Shared physical pages improve efficiency, but multi-tenant serving then has to think harder about metadata leakage, timing visibility, and safe reuse boundaries.
Fragmentation changes shape
Paging removes the need for giant contiguous buffers, but it does not remove waste altogether. Instead you get internal slack, metadata overhead, and scheduling friction when batches mix very different page formats and ages.
Cold Tiers Are Real, and They Hurt
The memory hierarchy endgame likely extends beyond HBM. Registers hold active fragments. SRAM holds the live decompressed tile. L2 caches metadata and hot prefixes. HBM holds the compressed active working set. Host DRAM or CXL-attached memory becomes a colder KV tier, and SSD-backed archival state may eventually hold the deepest, least urgent context.
But the cold tiers only make sense if the runtime is proactive. Pulling a cold KV page back from DRAM, CXL memory, or SSD into HBM can be far more expensive than a normal decode-step budget wants to tolerate.
What the Ideal System Would Feel Like
If this stack were fully realized, long-context inference would feel less fragile. The runtime would allocate and share KV pages naturally. The scheduler would batch partly by locality and block reuse, not just token count. The kernels would stream context tile by tile without ever demanding giant contiguous buffers. The hardware would make page fetch, low-bit unpacking, and on-chip staging cheap enough that the software architecture actually holds.
That is the real endgame. PagedAttention gives the system a global memory model. FlashAttention gives it an execution model. Low-bit KV schemes give it a bandwidth model. The right hardware turns those into one coherent stack so the KV cache stops behaving like an awkward inference tax and starts behaving like a real memory subsystem.