Why eviction policy matters for long-context inference
Modern long-context inference systems — serving 128K–1M token sequences — face a fundamental resource problem. KV caches grow linearly with sequence length and must be served from the memory tier closest to the accelerator. When they exceed fast-tier capacity, something must move. The question is what, and the answer to "what" determines both latency and bandwidth cost.
Most serving stacks use LRU or simple frequency-based eviction. These work well when the working set fits comfortably, but collapse under pressure. Adversarial access patterns — retrieval bursts that revisit recently-evicted pages, multi-turn conversations that alternate hot-and-cold segments — expose the weakness of stateless eviction: the system has no memory of its own mistakes.
Recent work increasingly explores eviction strategies that go beyond recency and frequency. Attention-score-guided retention, learned reuse predictors, and future-aware cache management all attempt to reason about which KV entries will matter later. Regret-aware eviction sits in this design space as a deliberately lightweight heuristic: rather than predicting future access with a learned model, it uses the observable signal of fast re-access after eviction as a cheap proxy for future miss cost. It is not state-of-the-art predictive caching — it is a minimal mechanism for learning from eviction mistakes without requiring attention-pattern analysis or model introspection.
The goal is not to "solve KV cache inference" in one repo. The goal is to provide a trace-driven, reproducible policy evaluation framework that systems engineers can extend, question, and calibrate.
The regret-aware eviction mechanism
The core idea is simple: if a page is re-accessed shortly after being evicted, the original eviction was probably wrong. The policy remembers this mistake and uses it to protect that page from future eviction.
The intuition
Every eviction is a bet. The policy bets that the evicted page won't be needed soon. When that bet fails — the page is re-accessed shortly after removal, forcing a costly re-promotion from a slower tier — the system pays a transfer and latency penalty that could have been avoided. Regret-aware eviction treats these failures as evidence. A fast re-access after eviction approximates the downstream cost of that eviction decision: the shorter the gap between eviction and re-access, the higher the "regret" — and the more the policy should resist evicting that page again in the future.
This is not a prediction of future access. It is closer to a feedback signal: the policy observes its own mistakes and adjusts. In reinforcement-learning terms, regret acts as a lightweight reward signal that biases the eviction scoring function toward protecting pages with demonstrated re-access demand. The mechanism is deliberately reactive rather than predictive — no model weights, no attention-score inspection, no external signals beyond the trace itself.
How it works
- Eviction Tracking: When a page is evicted from any tier, the current logical step is recorded in an
eviction_stepdictionary. - Regret Detection: On access, if the page was recently evicted within a configurable
regret_horizon(default: 24 steps), a regret penalty is calculated. Shorter gaps produce higher penalties — a page re-accessed 2 steps after eviction gets much more regret than one re-accessed 20 steps later. - Decay: Existing regret scores decay slightly (default: 0.98×) on each subsequent access, allowing stale regret to fade rather than accumulate forever.
- Scoring: The eviction candidate is selected by minimizing a composite score:
freq_weight × frequency + recency_weight × normalized_recency + regret_weight × regret. High regret anchors a page, making it expensive to evict.
Ablation study: isolating the regret signal
The central experimental question is: when does the regret signal actually help, and when does it add nothing? To answer this cleanly, we run a dedicated ablation study that sweeps over regret parameters while holding everything else fixed.
Methodology
- 3 independent seeds per ablation point — all results report mean ± std.
- Fresh policy instance per run — no state leakage between benchmark configurations.
- Prefetch disabled — isolates pure eviction behavior from speculative traffic noise.
- Tight capacity budgets — medium (1/3 base) and constrained (1/6 base) fast-tier capacity to force real eviction pressure.
- Longer traces — 512–640 steps with larger working sets to give policies enough history to diverge.
| Parameter | Values Swept | Purpose |
|---|---|---|
regret_weight | 1.0, 6.0, 12.0 | How strongly regret anchors a page |
regret_horizon | 8, 24, 64 steps | How far back the policy remembers evictions |
| Scenario | medium, constrained | Fast-tier capacity pressure |
| Workload | chat_continuation, periodic_reuse, adversarial_burst | Access pattern structure |
| Baselines | LRU, cost_aware | Stateless and size-aware comparisons |
Results: where regret helps
Adversarial bursts under medium capacity
This workload alternates between early, middle, and late page sets — evicting pages and then immediately revisiting them. It is specifically designed to punish stateless eviction.
Under medium capacity, LRU averaged 2.48 ± 0.03 ms latency with a 0% fast-tier hit rate. The best regret-aware configuration (w=12.0, h=24) achieved 2.31 ± 0.03 ms — a 6.6% latency reduction — with 9.1% fewer bytes moved. The cost_aware baseline managed 2.37 ms, so regret-aware outperformed both.
Adversarial bursts under constrained capacity
Under severely constrained capacity (1/6 base), latencies climb sharply. LRU hit 6.15 ± 0.10 ms. The best regret configuration (w=12.0, h=64) cut bytes moved by 7.1% and latency by 6.0%. Longer horizons helped here because the adversarial trace structure specifically revisits recently-evicted pages — longer memory means more regret accumulation on the exact pages that need protection.
Horizon sensitivity
Under adversarial pressure, longer horizons (64 steps) consistently outperformed short horizons (8 steps) at the same weight. This makes intuitive sense: the adversarial workload revisits pages at gaps of 12–20 steps. A horizon of 8 simply cannot see those returns, so no regret is accumulated and the policy degrades to frequency+recency.
Results: where regret fails
Documenting failure cases is as important as documenting wins. The ablation reveals three distinct failure modes, each illuminating a different structural limitation of the regret mechanism.
Failure mode 1: deterministic sweeps (periodic reuse)
On periodic_reuse under both capacity levels, every regret-aware variant performed identically to LRU — 0% fast-tier hit rate, 640 misses, identical bytes moved. The periodic structure rotates through pages in a fixed cycle that completely overwhelms the fast tier. No finite regret horizon can adapt to a deterministic sweep that always exceeds capacity. Regret accumulates on every page equally, so the signal becomes uniform noise rather than a useful ranking.
Only cost_aware (which penalizes large pages, allowing more small pages to fit) salvaged any hits — achieving 12.5% hit rate under constrained capacity. This is significant: the only policy that helped was the one operating on a different signal dimension entirely (page size), not a smarter version of temporal reasoning.
Failure mode 2: overprotection under extreme pressure
High regret weights can backfire. Under constrained capacity on chat_continuation, the w=1.0 variants achieved only 53% hit rate versus LRU's 66%. What happened? With low fast-tier budget and aggressive regret anchoring, the policy overprotected a handful of previously-important pages that were no longer being accessed, preventing their eviction and blocking newer, genuinely hot pages from entering the fast tier. Stronger weights (w=12.0) partially corrected this by also boosting the frequency component, but the lesson is clear: regret without decay calibration can ossify the cache.
Failure mode 3: no eviction pressure (chat continuation, medium)
On purely recency-heavy traces where the working set fits comfortably, LRU achieved 0.23 ms latency versus regret-aware's 0.25 ms. The regret mechanism added slight overhead without benefit — there are simply no regret-inducing eviction-then-re-access patterns when pages rarely get evicted. This is the most benign failure: the policy defaults to recency-like behavior with a small constant-time overhead per access. Under constrained capacity where the working set no longer fits, regret-aware variants with high weights (w=12.0) did recover and approach LRU's hit rate.
What the baselines teach
Every eviction heuristic answers a different question about a page's value. The comparison is not about declaring a winner — it is about understanding what signal each policy uses and what it is structurally blind to.
| Policy | Primary Signal | What It Captures | Structural Blindspot |
|---|---|---|---|
lru | Last access time | Temporal locality; optimal when the working set fits in fast tier | No memory of past mistakes; blind to page size; thrashes when capacity is tight |
heavy_hitter | Access frequency | Long-term popularity; protects consistently hot pages | Slow to adapt to phase changes; frequency ≠ future reuse |
cost_aware | Frequency + recency + page size | Multi-signal composite; penalizes large pages, keeping more small ones resident | Does not learn from eviction outcomes; fixed weight balance |
regret_aware | Frequency + recency + eviction feedback | Adapts to its own mistakes in real-time; anchors pages with demonstrated re-access demand | Reactive only — cannot help on deterministic sweeps; adds state overhead; neutral when no evictions occur |
The key distinction is that LRU, heavy_hitter, and cost_aware operate on access statistics alone — they never observe the consequences of their eviction decisions. Regret-aware eviction closes that feedback loop: it watches what happens after an eviction and adjusts accordingly. This is the mechanism's entire contribution — and also its limitation, since the feedback is only useful when evictions are both frequent and consequential.
Quantization changes the game
One of the most important findings from the default sweep — not the ablation — is that quantization often dominates policy differences. Under the same small tier budget, switching from fp16 to int4 on a RAG burst workload raised overall hit rate from 0.459 to 0.771 and reduced bytes moved by 93.9%. No policy tuning produces that kind of gain.
This means that in practice, the first lever to pull is often footprint reduction, not eviction sophistication. Regret-aware eviction matters most when quantization has already been applied and residency pressure remains.
Engineering decisions worth noting
Fresh instances per run
Every benchmark run instantiates a fresh policy object. This was a critical correctness fix — an earlier version reused mutable policy instances across runs, contaminating later results with accumulated state from prior workloads. The factory pattern now matches the main benchmark pipeline.
Multi-seed variance
All ablation results report mean ± standard deviation across 3 independent seeds. The workload generators use deterministic seeding, so results are fully reproducible. Even modest seed variance occasionally shifts which configuration "wins" — single-seed ablation results should not be trusted for ordering claims.
Committed artifacts
All simulator outputs are committed to the repository. This includes raw JSON, CSV summaries, aggregated summaries with variance, run metadata (exact parameters used), and plots with error bars. Anyone can regenerate with a single command.
Known limitations
- Synthetic traces only. All results are simulator findings on generated access patterns. They are not runtime benchmarks and should not be treated as such.
- Synchronous engine. The simulator models latency additively (src_read + dst_write + transfer_ms) with no IO/compute overlap or concurrent bandwidth modeling. Real systems overlap memory movement with computation.
- Unbounded policy state.
RegretAwarePolicycurrently stores frequency, access times, and regret scores for every page ever seen. On very long traces, this will leak memory. A periodic state-culling hook would be needed for production use. - No oracle baseline. A Bélády (optimal offline) comparison would bound how much of the achievable gap over LRU the regret-aware policy actually captures. This is future work.
- No real runtime integration. The harness is designed for eventual vLLM trace ingestion, but that is v0.2 roadmap material, not a current capability.
Where this leads
The ablation results are honest about what regret-aware eviction can and cannot do on synthetic traces. The more interesting questions are about what comes next — and they branch in several directions.
From synthetic to real traces. The most immediate credibility gap is the absence of runtime trace ingestion. Synthetic generators produce structurally clean access patterns; real serving traces carry correlation structures, phase transitions, and tail behaviors that no generator captures faithfully. The harness is designed for eventual replay of production traces (e.g., from vLLM's PagedAttention scheduler), and that is the clearest next step toward validating whether regret signals survive contact with real workloads.
Oracle upper bounds. Without a Bélády (optimal offline) baseline, it is impossible to know whether the observed 6.6% latency improvement captures 10% or 80% of the achievable gap over LRU. Adding an offline oracle would transform the results from "regret-aware is better than LRU" into "regret-aware captures X% of the available headroom" — a far more informative statement for policy design.
Hybrid and compression-aware policies. The finding that quantization often dominates policy differences suggests that the next generation of eviction logic should be compression-aware: deciding not just whether to evict a page, but whether to demote, compress, or recompute it. A policy that can choose between "keep at fp16 in tier 0," "compress to int4 and keep," or "evict and pay the re-promotion cost" would operate on a richer action space than pure eviction. Regret signals could naturally extend to that setting — tracking not just eviction regret, but compression regret.
The limits of trace-driven simulation. Finally, it is worth acknowledging that even with perfect traces and calibrated tier models, simulation cannot fully substitute for runtime integration. Transfer costs depend on bandwidth contention, compute overlap, and scheduling context that a synchronous simulator does not model. The harness is useful for comparing policies under controlled conditions and generating hypotheses about where regret signals help — but confirming those hypotheses requires measurement in a real serving stack.
The interesting question is not whether regret-aware eviction is the best KV-cache policy. It is whether the broader idea — that eviction policies should observe the consequences of their own decisions — leads to meaningfully better memory systems. This repo is one small, reproducible experiment in that direction.
Repository
The full harness, all policies, ablation scripts, committed artifacts, and tests are available on GitHub:
Reproduce the ablation:
python scripts/run_regret_ablation.py --output-dir artifacts/regret_ablationRun tests:
python -m pytest