← All posts
Memory Systems · Inference Architecture · Compute-Memory Tradeoffs

The Recompute-vs-Transfer Frontier

Published · 9 min read

When redoing work is cheaper than moving bytes — and why eviction policies should know the difference between pages worth storing and pages worth recomputing.

By Manish KL ~17 min read Technical Essay
Abstract

The memory hierarchy literature assumes that every evicted piece of data will eventually be re-loaded from a slower tier. But for KV-cache pages in long-context inference, there is a fourth option beyond keep, demote, and evict: recompute. Rerunning a few attention layers from the original token embeddings can sometimes cost less than the bandwidth penalty of fetching demoted pages from host RAM — especially when the GPU has spare compute cycles during memory-bound decode. This essay examines the recompute-vs-transfer frontier: the decision surface where compute cost, bandwidth availability, reuse probability, and tier latency determine whether a page is worth storing at all.

4
actions in a residency-aware policy: keep, demote, evict, recompute
~50μs
recompute cost for 1 KV page (128 tokens, 1 layer) on H100
~200μs
host RAM round-trip for the same page via PCIe 5.0
>60%
GPU compute idle during memory-bound decode steps

The missing fourth action

Traditional cache management offers three actions for a resident page: keep it in fast tier, demote it to a slower tier, or evict it entirely (discarding the data). Every eviction policy — LRU, frequency-based, cost-aware, regret-aware — operates on this three-action space. The implicit assumption is that evicted or demoted data must be re-loaded from wherever it was sent.

But KV-cache pages are not arbitrary data. They are derived from the original token embeddings through a deterministic sequence of attention computations. Any KV page can be reconstructed from the corresponding input tokens by replaying the attention layers. This means there is a fourth action: recompute on demand — discard the page, and when it is next needed, regenerate it from the source tokens rather than fetching it from a slower memory tier.

The cheapest byte to move is the one you never move. For some KV pages, the cheapest eviction policy is to never store them in the first place — and recompute them on demand.
The Four-Action Residency Decision traditional policies use 3 actions — the recompute path is the missing fourth KEEP page stays in GPU HBM cost: 0 (already resident) access: ~1μs occupies fast-tier capacity DEMOTE page moves to host RAM cost: PCIe write + later read re-access: ~200μs bandwidth-limited EVICT page discarded entirely cost: permanent loss re-access: impossible quality degradation RECOMPUTE ★ page discarded, rebuilt on demand cost: attention recomputation re-access: ~50μs (compute-bound) uses idle compute, no bandwidth The Decision Frontier if recompute_cost(page) < transfer_cost(page) AND gpu_compute_available > threshold: → discard and recompute on demand (no bandwidth consumed, no storage needed) this is the recompute-vs-transfer frontier
Figure 1. The four-action residency decision. Traditional policies operate on keep/demote/evict. Recompute adds a fourth path that trades idle GPU compute for bandwidth — useful when compute is abundant and bandwidth is scarce.

When recompute wins

The recompute-vs-transfer decision depends on a ratio: the compute cost of regenerating a KV page versus the bandwidth cost of fetching it from wherever it was stored. This ratio is not fixed — it varies with model architecture, sequence position, hardware generation, and current system load.

The cost model

For a single KV page covering T tokens across L layers with H attention heads and head dimension D:

Recompute cost: ~2 × T × L × H × D × D FLOPs (attention QKV projection + softmax + output projection)
Transfer cost: T × L × H × D × 2 × bytes_per_element / bandwidth (PCIe or NVLink transfer time)

On H100: recompute a 128-token, single-layer page ≈ 50μs. Transfer the same page from host RAM via PCIe 5.0 ≈ 200μs. Recompute wins by 4×.

The critical variable is GPU compute utilization. During autoregressive decode, the GPU is severely memory-bound — the arithmetic units are idle 60–80% of the time, waiting for HBM reads to complete. Those idle FLOPS are free. Using them for KV recomputation converts waste heat into useful work, without contending for the memory bandwidth that is already saturated.

Where the frontier lies

Recompute is cheaper than transfer when:

The Recompute-Transfer Frontier page size × layer depth → which action is cheaper high low cost (μs) page size (tokens) → crossover RECOMPUTE WINS TRANSFER WINS recompute cost transfer cost (PCIe) When Recompute Wins ✓ page covers few tokens (64–256) ✓ only 1–4 layers need recomputation ✓ GPU compute utilization < 40% (decode phase) ✓ alternative is PCIe/NVMe (slow transfer)
Figure 2. Left: recompute cost grows linearly with page size while transfer cost is bandwidth-limited (flatter curve). Below the crossover, recompute is cheaper. Right: conditions that shift the crossover in recompute's favor.

FlashAttention already proved the concept

The recompute-vs-transfer tradeoff is not new in attention computation — FlashAttention demonstrated it at a different scale. Instead of materializing the full N×N attention matrix in HBM and reading it back, FlashAttention recomputes attention scores from Q and K in SRAM, trading ~50% more FLOPs for dramatically less HBM bandwidth. The insight was that HBM bandwidth was scarcer than compute, so burning extra FLOPs to avoid a large materialization was a net win.

The KV-cache recompute-vs-transfer frontier extends this same principle from within a single attention computation to across the memory hierarchy. Instead of asking "should I materialize the attention matrix?", the question becomes "should I even store this KV page, or should I regenerate it from tokens when needed?"

The FlashAttention parallel: FlashAttention trades FLOPs for HBM bandwidth within a single kernel. Recompute-vs-transfer trades FLOPs for PCIe/NVLink bandwidth across the memory hierarchy. Both exploit the same asymmetry: bandwidth is scarcer than compute in modern GPU systems.

Implications for eviction policy design

If recompute is sometimes cheaper than transfer, then eviction policies need a richer action space. Instead of deciding "which page to evict," the policy decides "which action to take for each page under pressure."

Page PropertyBest ActionReasoning
High reuse, small, few layersKeepAccess cost is low and reuse is likely — not worth evicting
Low reuse, small, few layersRecomputeCheap to regenerate; storage isn't justified by low reuse
High reuse, large, many layersDemoteToo expensive to recompute; worth the transfer cost given high reuse
Low reuse, large, many layersEvictToo expensive to recompute or store; accept the quality loss
Any page during decode phaseRecompute preferredCompute is idle during decode; bandwidth is the bottleneck
Any page during prefill phaseDemote preferredCompute is saturated during prefill; bandwidth is available

The compression dimension

There is actually a fifth action that sits between "keep at full precision" and "recompute from scratch": compress in place. Quantizing a KV page from fp16 to int4 reduces its footprint by 4× while retaining the data in fast tier — no bandwidth consumed, no recomputation needed, but with a quality cost from quantization error.

This creates a rich action lattice: keep_fp16 → compress_to_int4 → demote_to_host → recompute_on_demand → evict. Each action trades a different resource (capacity, quality, compute, bandwidth) and the optimal choice depends on the page's reuse probability, the current system load, and the quality tolerance of the request.

What remains hard

The recompute-vs-transfer frontier is not a single number — it is a surface that moves with every decode step. The policy that navigates this surface well will define the next generation of memory-efficient long-context inference.

Where this leads

The recompute-vs-transfer frontier suggests that the future of KV-cache management is not "better eviction policies" but "richer action spaces." A policy that can choose between keep, compress, demote, recompute, and evict — and that can reason about the cost of each action under current system conditions — operates on a fundamentally more expressive design surface than one that can only choose "which page to kick out."

This connects to several open research directions: compiler-emitted recompute hints (the compiler knows which layers are cheap to recompute), hardware-assisted recompute scheduling (dedicated recompute engines that run alongside the main datapath), and combined compression-recompute policies that choose the minimum-cost action per page per step.

The broader principle is that compute and bandwidth are fungible under the right conditions, and the memory system should be designed to exploit that fungibility rather than treating them as independent resources.