AI Infrastructure · Memory Systems · Inference

The Ideal PagedAttention Stack

PagedAttention is not just a neat runtime trick. It is the beginning of a real memory architecture for transformer inference: global KV paging, local FlashAttention-style tiling, low-bit KV storage in HBM, and decompression only inside SRAM-sized working sets. Once context windows stretch toward the million-token regime, this stops being an implementation detail and starts becoming the product.

Published 2026-05-13 ~19 min read PagedAttention · FlashAttention · HBM · SRAM · CXL

The best mental model is simple: PagedAttention should become the virtual-memory subsystem for KV cache, and FlashAttention should become the tiled execution model that consumes those pages without wasting bandwidth.

Why This Problem Gets Big So Fast

Long-context inference looks innocent right up until you write down the memory math. For a large grouped-query transformer, a million-token KV cache in raw FP16 can already push into the few-hundred-gigabyte range per active sequence, and denser multi-head layouts can drift toward TB-class footprints. Even when low-bit KV compression cuts that dramatically, the hard part does not disappear. The hard part becomes how to store, route, prefetch, and consume that state without drowning in HBM traffic.

This is why contiguous per-request KV buffers stop feeling elegant in production. Real serving stacks have requests of different lengths, requests that share prefixes, requests that terminate early, and requests that arrive in bursts. The KV cache is no longer just a tensor. It is a live, multi-tenant memory object under continuous growth and repeated reread.

Why PagedAttention Exists At All

The core insight behind PagedAttention is that a logical sequence does not need to be physically contiguous. It can be represented as a stream of fixed-size blocks, with a small mapping structure translating logical token ranges to physical KV pages. That makes append-heavy decode cheaper, cuts fragmentation, and opens the door to prefix sharing.

Logical sequence to physical pages
Logical Request A L0-31 L32-63 L64-95 L96-127 Logical Request B L0-31 L32-63 L64-95 Page Table L0-31 → P17 shared prefix L32-63 → P41 shared prefix L64-95 → P88 request-local L96-127 → P91 request-local

Once you represent KV this way, the runtime begins to resemble a memory manager. The sequences feel contiguous to the model, but physically they are assembled from reusable blocks. That is the move that makes shared prefixes and dynamic growth manageable.

What the Ideal Stack Looks Like

The cleanest design splits responsibility three ways. The runtime owns allocation, mapping, sharing, and reclaim. The kernel owns tiled attention execution and local staging. The hardware owns high-bandwidth fetch, low-bit unpacking, page-translation assist, and enough SRAM to keep the expanded working set off HBM.

LayerWhat it should own
Runtime / softwareFixed-size KV blocks, page tables, prefix sharing, age/hotness metadata, reclaim and spill policy
KernelsFlashAttention-style tiles, local decompression, QKᵀ accumulation, weighted-V accumulation, overlap of fetch and compute
HardwareHBM bandwidth, larger per-SM SRAM, low-bit unpack engines, page-translation assist, large L2, CXL-aware cold-page movement

What the Runtime Should Actually Do

A serious runtime should behave less like a tensor allocator and more like a specialized KV operating system. It should allocate fixed-size blocks, maintain a compact logical-to-physical map, optimize for cheap append during decode, and keep enough metadata to understand page precision, heat, age, shareability, and eviction status.

This is also the layer where policy gets interesting. Once page temperature and reuse become visible, the scheduler can start forming batches partly by memory locality instead of purely by token count.

What the Kernel Should Actually Do

The kernel path should feel like FlashAttention extended into a paged and compressed world. A query tile becomes active, the runtime exposes the relevant page mapping, compressed KV blocks are prefetched from HBM, the active tile is unpacked into SRAM and registers, attention math runs, and the temporary tile is discarded.

Ideal decode tile pipeline
Query tile decode step N Page lookup logical → physical Compressed KV HBM page stream Unpack in SRAM tiny local window QKᵀ + V fused compute Drop tile

By low-bit KV compression here, I mean the broad family of KV-cache schemes that store state in FP8, INT8, groupwise INT4, or similar compressed forms. The exact codec can vary. The invariant is what matters: keep the KV compressed in HBM, and only inflate the part of it that fits into an on-chip tile.

Why SRAM Is the Center of Gravity

HBM gets all the attention because it is large and bandwidth-rich, but the most important part of the story is what happens after the bytes leave HBM. If compressed KV is going to pay off, the expanded working set must live briefly in SRAM or registers and then disappear. If it gets materialized back into a giant HBM tensor, the whole point is lost.

Example dequantized tile
tile_tokens = 128
kv_heads    = 8
head_dim    = 128
bytes_per   = 2   # FP16 temporary

K + V bytes
= 128 * 8 * 128 * 2 * 2
= 524,288 bytes
≈ 512 KB

That half-megabyte estimate is for one useful K/V tile before counting query fragments, softmax state, scale metadata, overlap buffers, or safety margin. That is why a future-focused inference accelerator wants something like 512 KB to 1 MB per SM as a comfortable floor, with 1 MB to 2 MB starting to feel genuinely generous for compressed attention.

Concrete Hardware Targets

ResourceDesirable targetWhy it matters
Per-SM SRAM / shared memory512 KB to 1 MB desirable, 2 MB excellentRoom for useful K/V tiles plus overlap and control state
L2 cache64 MB to 256 MB classHot page metadata and shared prefixes become worth keeping close
HBM bandwidthHBM3e-class and above, 3 TB/s minimum, 5 TB/s+ idealEven compressed decode remains memory-sensitive
KV block geometry32 or 64 tokens often attractiveGood balance between metadata cost and scheduling flexibility
Compressed KV formatFP8 / INT8 baseline, INT4 for colder pagesLets runtime vary precision by age, heat, and reuse
Metadata overheadBelow 2% of total KV capacityThe page table must not become a second memory problem

If the title promises hardware and software, this is where the hardware side needs to land. The ideal device wants not just more HBM, but larger SRAM per SM, native low-bit unpack and scale units, better support for block-stream gather, and some lightweight translation assist so page walks do not become the moral equivalent of TLB misses in a hot decode loop.

What Can Go Wrong

Elegant systems stories get stronger when the pain points are explicit. Paging everything sounds beautiful until the tradeoffs arrive.

TLB-like lookup overhead

The moment logical token ranges must be translated into physical blocks, you create the equivalent of a page walk. If translation metadata is cache-cold or awkwardly structured, the kernel can stall on bookkeeping before it reaches useful math.

Page size versus tile size mismatch

Smaller pages are great for sharing and fragmentation, but they increase translation overhead. Larger pages stream better, but they can mismatch the kernel’s actual tile size and create underused tails.

Compression accuracy versus decode latency

More aggressive KV compression reduces HBM traffic, but raises unpacking, scale-application, and correction costs. If dequantization costs more than the bytes saved, the win evaporates.

Prefix sharing versus isolation

Shared physical pages improve efficiency, but multi-tenant serving then has to think harder about metadata leakage, timing visibility, and safe reuse boundaries.

Fragmentation changes shape

Paging removes the need for giant contiguous buffers, but it does not remove waste altogether. Instead you get internal slack, metadata overhead, and scheduling friction when batches mix very different page formats and ages.

Cold Tiers Are Real, and They Hurt

The memory hierarchy endgame likely extends beyond HBM. Registers hold active fragments. SRAM holds the live decompressed tile. L2 caches metadata and hot prefixes. HBM holds the compressed active working set. Host DRAM or CXL-attached memory becomes a colder KV tier, and SSD-backed archival state may eventually hold the deepest, least urgent context.

But the cold tiers only make sense if the runtime is proactive. Pulling a cold KV page back from DRAM, CXL memory, or SSD into HBM can be far more expensive than a normal decode-step budget wants to tolerate.

A good scheduler should not wait until a request is already at the front of the GPU queue to fetch a cold block. It should predict reuse and start warming likely-needed KV pages upward ahead of time, so cold-tier latency is hidden behind queueing and other useful work.
A practical hierarchy for paged KV
Registers · active fragments SRAM / shared memory · live decompressed tile L2 · hot metadata and reused prefix pages HBM · compressed active KV working set Host DRAM / CXL / SSD · colder or archival pages, warmed upward by policy

What the Ideal System Would Feel Like

If this stack were fully realized, long-context inference would feel less fragile. The runtime would allocate and share KV pages naturally. The scheduler would batch partly by locality and block reuse, not just token count. The kernels would stream context tile by tile without ever demanding giant contiguous buffers. The hardware would make page fetch, low-bit unpacking, and on-chip staging cheap enough that the software architecture actually holds.

That is the real endgame. PagedAttention gives the system a global memory model. FlashAttention gives it an execution model. Low-bit KV schemes give it a bandwidth model. The right hardware turns those into one coherent stack so the KV cache stops behaving like an awkward inference tax and starts behaving like a real memory subsystem.