The Ideal PagedAttention Stack: Hardware + Software for Long-Context Inference | Writings

Why This Problem Gets Big So Fast

Long-context inference looks innocent right up until you write down the memory math. For a large grouped-query transformer, a million-token KV cache in raw FP16 can already push into the few-hundred-gigabyte range per active sequence, and denser multi-head layouts can drift toward TB-class footprints. Even when low-bit KV compression cuts that dramatically, the hard part does not disappear. The hard part becomes how to store, route, prefetch, and consume that state without drowning in HBM traffic.

This is why contiguous per-request KV buffers stop feeling elegant in production. Real serving stacks have requests of different lengths, requests that share prefixes, requests that terminate early, and requests that arrive in bursts. The KV cache is no longer just a tensor. It is a live, multi-tenant memory object under continuous growth and repeated reread.

Why PagedAttention Exists At All

The core insight behind PagedAttention is that a logical sequence does not need to be physically contiguous. It can be represented as a stream of fixed-size blocks, with a small mapping structure translating logical token ranges to physical KV pages. That makes append-heavy decode cheaper, cuts fragmentation, and opens the door to prefix sharing.

Logical sequence to physical pages

Once you represent KV this way, the runtime begins to resemble a memory manager. The sequences feel contiguous to the model, but physically they are assembled from reusable blocks. That is the move that makes shared prefixes and dynamic growth manageable.

What the Ideal Stack Looks Like

The cleanest design splits responsibility three ways. The runtime owns allocation, mapping, sharing, and reclaim. The kernel owns tiled attention execution and local staging. The hardware owns high-bandwidth fetch, low-bit unpacking, page-translation assist, and enough SRAM to keep the expanded working set off HBM.

Layer	What it should own
Runtime / software	Fixed-size KV blocks, page tables, prefix sharing, age/hotness metadata, reclaim and spill policy
Kernels	FlashAttention-style tiles, local decompression, QKᵀ accumulation, weighted-V accumulation, overlap of fetch and compute
Hardware	HBM bandwidth, larger per-SM SRAM, low-bit unpack engines, page-translation assist, large L2, CXL-aware cold-page movement

What the Runtime Should Actually Do

A serious runtime should behave less like a tensor allocator and more like a specialized KV operating system. It should allocate fixed-size blocks, maintain a compact logical-to-physical map, optimize for cheap append during decode, and keep enough metadata to understand page precision, heat, age, shareability, and eviction status.

It should optimize for append-heavy decode rather than arbitrary in-place mutation.
It should treat shared prompt prefixes as native shared physical pages rather than copied buffers.
It should know whether a page is FP16, FP8, INT8, or INT4 and whether that page is hot enough to keep near the GPU.
It should organize free lists so completed sessions return capacity immediately without requiring heavy compaction.

This is also the layer where policy gets interesting. Once page temperature and reuse become visible, the scheduler can start forming batches partly by memory locality instead of purely by token count.

What the Kernel Should Actually Do

The kernel path should feel like FlashAttention extended into a paged and compressed world. A query tile becomes active, the runtime exposes the relevant page mapping, compressed KV blocks are prefetched from HBM, the active tile is unpacked into SRAM and registers, attention math runs, and the temporary tile is discarded.

Ideal decode tile pipeline

By low-bit KV compression here, I mean the broad family of KV-cache schemes that store state in FP8, INT8, groupwise INT4, or similar compressed forms. The exact codec can vary. The invariant is what matters: keep the KV compressed in HBM, and only inflate the part of it that fits into an on-chip tile.

Why SRAM Is the Center of Gravity

HBM gets all the attention because it is large and bandwidth-rich, but the most important part of the story is what happens after the bytes leave HBM. If compressed KV is going to pay off, the expanded working set must live briefly in SRAM or registers and then disappear. If it gets materialized back into a giant HBM tensor, the whole point is lost.

Example dequantized tile
tile_tokens = 128
kv_heads    = 8
head_dim    = 128
bytes_per   = 2   # FP16 temporary

K + V bytes
= 128 * 8 * 128 * 2 * 2
= 524,288 bytes
≈ 512 KB

That half-megabyte estimate is for one useful K/V tile before counting query fragments, softmax state, scale metadata, overlap buffers, or safety margin. That is why a future-focused inference accelerator wants something like 512 KB to 1 MB per SM as a comfortable floor, with 1 MB to 2 MB starting to feel genuinely generous for compressed attention.

Concrete Hardware Targets

Resource	Desirable target	Why it matters
Per-SM SRAM / shared memory	512 KB to 1 MB desirable, 2 MB excellent	Room for useful K/V tiles plus overlap and control state
L2 cache	64 MB to 256 MB class	Hot page metadata and shared prefixes become worth keeping close
HBM bandwidth	HBM3e-class and above, 3 TB/s minimum, 5 TB/s+ ideal	Even compressed decode remains memory-sensitive
KV block geometry	32 or 64 tokens often attractive	Good balance between metadata cost and scheduling flexibility
Compressed KV format	FP8 / INT8 baseline, INT4 for colder pages	Lets runtime vary precision by age, heat, and reuse
Metadata overhead	Below 2% of total KV capacity	The page table must not become a second memory problem

If the title promises hardware and software, this is where the hardware side needs to land. The ideal device wants not just more HBM, but larger SRAM per SM, native low-bit unpack and scale units, better support for block-stream gather, and some lightweight translation assist so page walks do not become the moral equivalent of TLB misses in a hot decode loop.

What Can Go Wrong

Elegant systems stories get stronger when the pain points are explicit. Paging everything sounds beautiful until the tradeoffs arrive.

TLB-like lookup overhead

The moment logical token ranges must be translated into physical blocks, you create the equivalent of a page walk. If translation metadata is cache-cold or awkwardly structured, the kernel can stall on bookkeeping before it reaches useful math.

Page size versus tile size mismatch

Smaller pages are great for sharing and fragmentation, but they increase translation overhead. Larger pages stream better, but they can mismatch the kernel’s actual tile size and create underused tails.

Compression accuracy versus decode latency

More aggressive KV compression reduces HBM traffic, but raises unpacking, scale-application, and correction costs. If dequantization costs more than the bytes saved, the win evaporates.

Prefix sharing versus isolation

Shared physical pages improve efficiency, but multi-tenant serving then has to think harder about metadata leakage, timing visibility, and safe reuse boundaries.

Fragmentation changes shape

Paging removes the need for giant contiguous buffers, but it does not remove waste altogether. Instead you get internal slack, metadata overhead, and scheduling friction when batches mix very different page formats and ages.

Cold Tiers Are Real, and They Hurt

The memory hierarchy endgame likely extends beyond HBM. Registers hold active fragments. SRAM holds the live decompressed tile. L2 caches metadata and hot prefixes. HBM holds the compressed active working set. Host DRAM or CXL-attached memory becomes a colder KV tier, and SSD-backed archival state may eventually hold the deepest, least urgent context.

But the cold tiers only make sense if the runtime is proactive. Pulling a cold KV page back from DRAM, CXL memory, or SSD into HBM can be far more expensive than a normal decode-step budget wants to tolerate.

A good scheduler should not wait until a request is already at the front of the GPU queue to fetch a cold block. It should predict reuse and start warming likely-needed KV pages upward ahead of time, so cold-tier latency is hidden behind queueing and other useful work.

A practical hierarchy for paged KV

What the Ideal System Would Feel Like

If this stack were fully realized, long-context inference would feel less fragile. The runtime would allocate and share KV pages naturally. The scheduler would batch partly by locality and block reuse, not just token count. The kernels would stream context tile by tile without ever demanding giant contiguous buffers. The hardware would make page fetch, low-bit unpacking, and on-chip staging cheap enough that the software architecture actually holds.

That is the real endgame. PagedAttention gives the system a global memory model. FlashAttention gives it an execution model. Low-bit KV schemes give it a bandwidth model. The right hardware turns those into one coherent stack so the KV cache stops behaving like an awkward inference tax and starts behaving like a real memory subsystem.