The missing fourth action
Traditional cache management offers three actions for a resident page: keep it in fast tier, demote it to a slower tier, or evict it entirely (discarding the data). Every eviction policy — LRU, frequency-based, cost-aware, regret-aware — operates on this three-action space. The implicit assumption is that evicted or demoted data must be re-loaded from wherever it was sent.
But KV-cache pages are not arbitrary data. They are derived from the original token embeddings through a deterministic sequence of attention computations. Any KV page can be reconstructed from the corresponding input tokens by replaying the attention layers. This means there is a fourth action: recompute on demand — discard the page, and when it is next needed, regenerate it from the source tokens rather than fetching it from a slower memory tier.
The cheapest byte to move is the one you never move. For some KV pages, the cheapest eviction policy is to never store them in the first place — and recompute them on demand.
When recompute wins
The recompute-vs-transfer decision depends on a ratio: the compute cost of regenerating a KV page versus the bandwidth cost of fetching it from wherever it was stored. This ratio is not fixed — it varies with model architecture, sequence position, hardware generation, and current system load.
The cost model
For a single KV page covering T tokens across L layers with H attention heads and head dimension D:
~2 × T × L × H × D × D FLOPs (attention QKV projection + softmax + output projection)Transfer cost:
T × L × H × D × 2 × bytes_per_element / bandwidth (PCIe or NVLink transfer time)On H100: recompute a 128-token, single-layer page ≈ 50μs. Transfer the same page from host RAM via PCIe 5.0 ≈ 200μs. Recompute wins by 4×.
The critical variable is GPU compute utilization. During autoregressive decode, the GPU is severely memory-bound — the arithmetic units are idle 60–80% of the time, waiting for HBM reads to complete. Those idle FLOPS are free. Using them for KV recomputation converts waste heat into useful work, without contending for the memory bandwidth that is already saturated.
Where the frontier lies
Recompute is cheaper than transfer when:
- Few layers need recomputation. Regenerating KV for 1–4 layers is fast. Regenerating all 32 layers (full-depth recompute) is almost always more expensive than a host-RAM fetch.
- The page is small. Pages covering 64–256 tokens have low recompute cost. Pages covering 4K+ tokens begin to cross over — the compute cost grows linearly while transfer cost grows at the bandwidth-limited rate.
- Compute is idle. During decode steps, GPU compute is underutilized. During prefill, compute is fully occupied — recompute would compete with the main workload.
- The alternative is slow. Fetching from host RAM (PCIe 5.0: ~64 GB/s) is 50× slower than HBM access. Fetching from NVMe is 500× slower. The slower the alternative, the more attractive recompute becomes.
- Reuse is uncertain. If a page has low reuse probability, the cost of storing it (demotion bandwidth + storage + future promotion bandwidth) exceeds the cost of discarding it and recomputing only if it is actually needed.
FlashAttention already proved the concept
The recompute-vs-transfer tradeoff is not new in attention computation — FlashAttention demonstrated it at a different scale. Instead of materializing the full N×N attention matrix in HBM and reading it back, FlashAttention recomputes attention scores from Q and K in SRAM, trading ~50% more FLOPs for dramatically less HBM bandwidth. The insight was that HBM bandwidth was scarcer than compute, so burning extra FLOPs to avoid a large materialization was a net win.
The KV-cache recompute-vs-transfer frontier extends this same principle from within a single attention computation to across the memory hierarchy. Instead of asking "should I materialize the attention matrix?", the question becomes "should I even store this KV page, or should I regenerate it from tokens when needed?"
Implications for eviction policy design
If recompute is sometimes cheaper than transfer, then eviction policies need a richer action space. Instead of deciding "which page to evict," the policy decides "which action to take for each page under pressure."
| Page Property | Best Action | Reasoning |
|---|---|---|
| High reuse, small, few layers | Keep | Access cost is low and reuse is likely — not worth evicting |
| Low reuse, small, few layers | Recompute | Cheap to regenerate; storage isn't justified by low reuse |
| High reuse, large, many layers | Demote | Too expensive to recompute; worth the transfer cost given high reuse |
| Low reuse, large, many layers | Evict | Too expensive to recompute or store; accept the quality loss |
| Any page during decode phase | Recompute preferred | Compute is idle during decode; bandwidth is the bottleneck |
| Any page during prefill phase | Demote preferred | Compute is saturated during prefill; bandwidth is available |
The compression dimension
There is actually a fifth action that sits between "keep at full precision" and "recompute from scratch": compress in place. Quantizing a KV page from fp16 to int4 reduces its footprint by 4× while retaining the data in fast tier — no bandwidth consumed, no recomputation needed, but with a quality cost from quantization error.
This creates a rich action lattice: keep_fp16 → compress_to_int4 → demote_to_host → recompute_on_demand → evict. Each action trades a different resource (capacity, quality, compute, bandwidth) and the optimal choice depends on the page's reuse probability, the current system load, and the quality tolerance of the request.
What remains hard
- Reuse prediction. The recompute decision depends on whether the page will be re-accessed. Predicting reuse from the access trace alone is the same hard problem that makes eviction policy design difficult in general. Without reuse prediction, the policy must use heuristics (recent access frequency, position in the sequence, attention-score history).
- Compute scheduling. Recompute must be scheduled alongside the main inference workload. On GPUs, this means either interleaving recompute kernels with decode kernels (adding scheduling complexity) or using separate CUDA streams (risking resource contention). The idle-compute argument only holds if the recompute work genuinely fits in the gaps between memory-bound operations.
- Latency variance. Recompute latency depends on current compute load, which varies per decode step. Transfer latency is more predictable (bounded by PCIe bandwidth). SLO-sensitive systems may prefer the predictability of transfer even when recompute is cheaper on average.
- Multi-layer depth. Recomputing KV for a single layer is cheap. But if the required page's KV depends on earlier layers' KV (which it does, in a transformer), then recomputing layer L requires the KV from layers 1 through L-1 to be available. This creates a dependency chain that can make deep recomputation impractical without checkpointing intermediate KV at strategic layer boundaries.
The recompute-vs-transfer frontier is not a single number — it is a surface that moves with every decode step. The policy that navigates this surface well will define the next generation of memory-efficient long-context inference.
Where this leads
The recompute-vs-transfer frontier suggests that the future of KV-cache management is not "better eviction policies" but "richer action spaces." A policy that can choose between keep, compress, demote, recompute, and evict — and that can reason about the cost of each action under current system conditions — operates on a fundamentally more expressive design surface than one that can only choose "which page to kick out."
This connects to several open research directions: compiler-emitted recompute hints (the compiler knows which layers are cheap to recompute), hardware-assisted recompute scheduling (dedicated recompute engines that run alongside the main datapath), and combined compression-recompute policies that choose the minimum-cost action per page per step.
The broader principle is that compute and bandwidth are fungible under the right conditions, and the memory system should be designed to exploit that fungibility rather than treating them as independent resources.