The Recompute-vs-Transfer Frontier: When Redoing Work Is Cheaper Than Moving Bytes

Abstract

The memory hierarchy literature assumes that every evicted piece of data will eventually be re-loaded from a slower tier. But for KV-cache pages in long-context inference, there is a fourth option beyond keep, demote, and evict: recompute. Rerunning a few attention layers from the original token embeddings can sometimes cost less than the bandwidth penalty of fetching demoted pages from host RAM — especially when the GPU has spare compute cycles during memory-bound decode. This essay examines the recompute-vs-transfer frontier: the decision surface where compute cost, bandwidth availability, reuse probability, and tier latency determine whether a page is worth storing at all.

The missing fourth action

Traditional cache management offers three actions for a resident page: keep it in fast tier, demote it to a slower tier, or evict it entirely (discarding the data). Every eviction policy — LRU, frequency-based, cost-aware, regret-aware — operates on this three-action space. The implicit assumption is that evicted or demoted data must be re-loaded from wherever it was sent.

But KV-cache pages are not arbitrary data. They are derived from the original token embeddings through a deterministic sequence of attention computations. Any KV page can be reconstructed from the corresponding input tokens by replaying the attention layers. This means there is a fourth action: recompute on demand — discard the page, and when it is next needed, regenerate it from the source tokens rather than fetching it from a slower memory tier.

The cheapest byte to move is the one you never move. For some KV pages, the cheapest eviction policy is to never store them in the first place — and recompute them on demand.

Figure 1. The four-action residency decision. Traditional policies operate on keep/demote/evict. Recompute adds a fourth path that trades idle GPU compute for bandwidth — useful when compute is abundant and bandwidth is scarce.

When recompute wins

The recompute-vs-transfer decision depends on a ratio: the compute cost of regenerating a KV page versus the bandwidth cost of fetching it from wherever it was stored. This ratio is not fixed — it varies with model architecture, sequence position, hardware generation, and current system load.

The cost model

For a single KV page covering T tokens across L layers with H attention heads and head dimension D:

Recompute cost: ~2 × T × L × H × D × D FLOPs (attention QKV projection + softmax + output projection)
Transfer cost: T × L × H × D × 2 × bytes_per_element / bandwidth (PCIe or NVLink transfer time)

On H100: recompute a 128-token, single-layer page ≈ 50μs. Transfer the same page from host RAM via PCIe 5.0 ≈ 200μs. Recompute wins by 4×.

The critical variable is GPU compute utilization. During autoregressive decode, the GPU is severely memory-bound — the arithmetic units are idle 60–80% of the time, waiting for HBM reads to complete. Those idle FLOPS are free. Using them for KV recomputation converts waste heat into useful work, without contending for the memory bandwidth that is already saturated.

Where the frontier lies

Recompute is cheaper than transfer when:

Few layers need recomputation. Regenerating KV for 1–4 layers is fast. Regenerating all 32 layers (full-depth recompute) is almost always more expensive than a host-RAM fetch.
The page is small. Pages covering 64–256 tokens have low recompute cost. Pages covering 4K+ tokens begin to cross over — the compute cost grows linearly while transfer cost grows at the bandwidth-limited rate.
Compute is idle. During decode steps, GPU compute is underutilized. During prefill, compute is fully occupied — recompute would compete with the main workload.
The alternative is slow. Fetching from host RAM (PCIe 5.0: ~64 GB/s) is 50× slower than HBM access. Fetching from NVMe is 500× slower. The slower the alternative, the more attractive recompute becomes.
Reuse is uncertain. If a page has low reuse probability, the cost of storing it (demotion bandwidth + storage + future promotion bandwidth) exceeds the cost of discarding it and recomputing only if it is actually needed.

Figure 2. Left: recompute cost grows linearly with page size while transfer cost is bandwidth-limited (flatter curve). Below the crossover, recompute is cheaper. Right: conditions that shift the crossover in recompute's favor.

FlashAttention already proved the concept

The recompute-vs-transfer tradeoff is not new in attention computation — FlashAttention demonstrated it at a different scale. Instead of materializing the full N×N attention matrix in HBM and reading it back, FlashAttention recomputes attention scores from Q and K in SRAM, trading ~50% more FLOPs for dramatically less HBM bandwidth. The insight was that HBM bandwidth was scarcer than compute, so burning extra FLOPs to avoid a large materialization was a net win.

The KV-cache recompute-vs-transfer frontier extends this same principle from within a single attention computation to across the memory hierarchy. Instead of asking "should I materialize the attention matrix?", the question becomes "should I even store this KV page, or should I regenerate it from tokens when needed?"

The FlashAttention parallel: FlashAttention trades FLOPs for HBM bandwidth within a single kernel. Recompute-vs-transfer trades FLOPs for PCIe/NVLink bandwidth across the memory hierarchy. Both exploit the same asymmetry: bandwidth is scarcer than compute in modern GPU systems.

Implications for eviction policy design

If recompute is sometimes cheaper than transfer, then eviction policies need a richer action space. Instead of deciding "which page to evict," the policy decides "which action to take for each page under pressure."

Page Property	Best Action	Reasoning
High reuse, small, few layers	Keep	Access cost is low and reuse is likely — not worth evicting
Low reuse, small, few layers	Recompute	Cheap to regenerate; storage isn't justified by low reuse
High reuse, large, many layers	Demote	Too expensive to recompute; worth the transfer cost given high reuse
Low reuse, large, many layers	Evict	Too expensive to recompute or store; accept the quality loss
Any page during decode phase	Recompute preferred	Compute is idle during decode; bandwidth is the bottleneck
Any page during prefill phase	Demote preferred	Compute is saturated during prefill; bandwidth is available

The compression dimension

There is actually a fifth action that sits between "keep at full precision" and "recompute from scratch": compress in place. Quantizing a KV page from fp16 to int4 reduces its footprint by 4× while retaining the data in fast tier — no bandwidth consumed, no recomputation needed, but with a quality cost from quantization error.

This creates a rich action lattice: keep_fp16 → compress_to_int4 → demote_to_host → recompute_on_demand → evict. Each action trades a different resource (capacity, quality, compute, bandwidth) and the optimal choice depends on the page's reuse probability, the current system load, and the quality tolerance of the request.

What remains hard

Reuse prediction. The recompute decision depends on whether the page will be re-accessed. Predicting reuse from the access trace alone is the same hard problem that makes eviction policy design difficult in general. Without reuse prediction, the policy must use heuristics (recent access frequency, position in the sequence, attention-score history).
Compute scheduling. Recompute must be scheduled alongside the main inference workload. On GPUs, this means either interleaving recompute kernels with decode kernels (adding scheduling complexity) or using separate CUDA streams (risking resource contention). The idle-compute argument only holds if the recompute work genuinely fits in the gaps between memory-bound operations.
Latency variance. Recompute latency depends on current compute load, which varies per decode step. Transfer latency is more predictable (bounded by PCIe bandwidth). SLO-sensitive systems may prefer the predictability of transfer even when recompute is cheaper on average.
Multi-layer depth. Recomputing KV for a single layer is cheap. But if the required page's KV depends on earlier layers' KV (which it does, in a transformer), then recomputing layer L requires the KV from layers 1 through L-1 to be available. This creates a dependency chain that can make deep recomputation impractical without checkpointing intermediate KV at strategic layer boundaries.

The recompute-vs-transfer frontier is not a single number — it is a surface that moves with every decode step. The policy that navigates this surface well will define the next generation of memory-efficient long-context inference.

Where this leads

The recompute-vs-transfer frontier suggests that the future of KV-cache management is not "better eviction policies" but "richer action spaces." A policy that can choose between keep, compress, demote, recompute, and evict — and that can reason about the cost of each action under current system conditions — operates on a fundamentally more expressive design surface than one that can only choose "which page to kick out."

This connects to several open research directions: compiler-emitted recompute hints (the compiler knows which layers are cheap to recompute), hardware-assisted recompute scheduling (dedicated recompute engines that run alongside the main datapath), and combined compression-recompute policies that choose the minimum-cost action per page per step.

The broader principle is that compute and bandwidth are fungible under the right conditions, and the memory system should be designed to exploit that fungibility rather than treating them as independent resources.

recompute-vs-transfer-frontier-inference.html · April 2026 · ← All writings