← All posts
AI Inference · Memory Systems · Experimental Results

KV Hierarchy Lab: Regret-Aware Eviction

Published · 15 min read

A trace-driven research harness for studying KV-cache residency, eviction, quantization, and prefetch tradeoffs — featuring a regret-aware eviction policy with multi-seed ablation results on synthetic traces.

By Manish KL ~18 min read Technical Essay + Experimental Results
Abstract

Long-context LLM inference is limited not just by compute, but by KV-cache residency and movement. When sequences exceed available fast-tier memory, the system must decide which KV pages stay resident, which get demoted, and which evictions are repeatedly costly. kv-hierarchy-lab provides a trace-driven, benchmark-first research harness for inspecting these tradeoffs. Its signature contribution is a regret-aware eviction policy that treats fast re-access after eviction as evidence that the eviction decision was wrong, and uses that signal to bias future decisions. A multi-seed ablation study isolates the regret mechanism's contribution across adversarial, recency-dominated, and periodic workloads under varying capacity pressure.

6.6%
latency reduction vs LRU on adversarial bursts (medium)
198
ablation runs across 3 seeds × 2 scenarios × 3 workloads × 11 policies
6
eviction policies compared in the default sweep
4
quantization schemes modeled (fp16, fp8, int4, int2)

Why eviction policy matters for long-context inference

Modern long-context inference systems — serving 128K–1M token sequences — face a fundamental resource problem. KV caches grow linearly with sequence length and must be served from the memory tier closest to the accelerator. When they exceed fast-tier capacity, something must move. The question is what, and the answer to "what" determines both latency and bandwidth cost.

Most serving stacks use LRU or simple frequency-based eviction. These work well when the working set fits comfortably, but collapse under pressure. Adversarial access patterns — retrieval bursts that revisit recently-evicted pages, multi-turn conversations that alternate hot-and-cold segments — expose the weakness of stateless eviction: the system has no memory of its own mistakes.

Recent work increasingly explores eviction strategies that go beyond recency and frequency. Attention-score-guided retention, learned reuse predictors, and future-aware cache management all attempt to reason about which KV entries will matter later. Regret-aware eviction sits in this design space as a deliberately lightweight heuristic: rather than predicting future access with a learned model, it uses the observable signal of fast re-access after eviction as a cheap proxy for future miss cost. It is not state-of-the-art predictive caching — it is a minimal mechanism for learning from eviction mistakes without requiring attention-pattern analysis or model introspection.

The goal is not to "solve KV cache inference" in one repo. The goal is to provide a trace-driven, reproducible policy evaluation framework that systems engineers can extend, question, and calibrate.
kv-hierarchy-lab: Simulation Architecture trace-driven · pluggable policies · configurable tiers · quantization-aware WORKLOADS chat_continuation rag_burst adversarial_burst periodic_reuse + 3 more Simulation Engine trace → tier-placement → metrics Tier 0: GPU-fast Tier 1: GPU-overflow Tier 2: Host RAM Tier 3: NVMe-like POLICY HOOK lru cost_aware heavy_hitter predictive regret_aware ★ windowed_recency QUANTIZATION fp16 · fp8 · int4 · int2 ARTIFACTS results.json summary.csv aggregated_summary.csv run_metadata.json plots/ latency_vs_horizon.png hit_rate_vs_weight.png tradeoff.png failure_case.png Benchmark Runner: scenario × workload × policy → fresh engine per run → aggregated metrics with mean ± std
Figure 1. kv-hierarchy-lab architecture. Traces flow through a configurable tier hierarchy with pluggable policies. Every run gets a fresh engine and policy instance to prevent state leakage.

The regret-aware eviction mechanism

The core idea is simple: if a page is re-accessed shortly after being evicted, the original eviction was probably wrong. The policy remembers this mistake and uses it to protect that page from future eviction.

The intuition

Every eviction is a bet. The policy bets that the evicted page won't be needed soon. When that bet fails — the page is re-accessed shortly after removal, forcing a costly re-promotion from a slower tier — the system pays a transfer and latency penalty that could have been avoided. Regret-aware eviction treats these failures as evidence. A fast re-access after eviction approximates the downstream cost of that eviction decision: the shorter the gap between eviction and re-access, the higher the "regret" — and the more the policy should resist evicting that page again in the future.

This is not a prediction of future access. It is closer to a feedback signal: the policy observes its own mistakes and adjusts. In reinforcement-learning terms, regret acts as a lightweight reward signal that biases the eviction scoring function toward protecting pages with demonstrated re-access demand. The mechanism is deliberately reactive rather than predictive — no model weights, no attention-score inspection, no external signals beyond the trace itself.

How it works

Regret-Aware Eviction: Decision Flow logical step → access A step 10 evict A step 20 re-access A step 24 gap = 4 steps < horizon (24) → REGRET PENALTY evict B step 30 re-access B step 62 gap = 32 steps > horizon (24) → NO PENALTY score(page) = freq_weight × frequency + recency_weight × (last_step / current_step) + regret_weight × regret eviction candidate = min(score) across all resident pages high regret → high score → page is protected from eviction
Figure 2. Regret detection mechanism. A fast re-access after eviction (gap < horizon) triggers a penalty that anchors the page. Slow returns beyond the horizon incur no penalty.
Key design choice: Regret is a signal, not a guarantee. A high regret score biases the eviction decision, but does not veto it. Under extreme capacity pressure, even high-regret pages can still be evicted if they become the least-scored candidate. This keeps the policy grounded in practical memory constraints.

Ablation study: isolating the regret signal

The central experimental question is: when does the regret signal actually help, and when does it add nothing? To answer this cleanly, we run a dedicated ablation study that sweeps over regret parameters while holding everything else fixed.

Methodology

ParameterValues SweptPurpose
regret_weight1.0, 6.0, 12.0How strongly regret anchors a page
regret_horizon8, 24, 64 stepsHow far back the policy remembers evictions
Scenariomedium, constrainedFast-tier capacity pressure
Workloadchat_continuation, periodic_reuse, adversarial_burstAccess pattern structure
BaselinesLRU, cost_awareStateless and size-aware comparisons

Results: where regret helps

Adversarial bursts under medium capacity

This workload alternates between early, middle, and late page sets — evicting pages and then immediately revisiting them. It is specifically designed to punish stateless eviction.

Under medium capacity, LRU averaged 2.48 ± 0.03 ms latency with a 0% fast-tier hit rate. The best regret-aware configuration (w=12.0, h=24) achieved 2.31 ± 0.03 ms — a 6.6% latency reduction — with 9.1% fewer bytes moved. The cost_aware baseline managed 2.37 ms, so regret-aware outperformed both.

Adversarial bursts under constrained capacity

Under severely constrained capacity (1/6 base), latencies climb sharply. LRU hit 6.15 ± 0.10 ms. The best regret configuration (w=12.0, h=64) cut bytes moved by 7.1% and latency by 6.0%. Longer horizons helped here because the adversarial trace structure specifically revisits recently-evicted pages — longer memory means more regret accumulation on the exact pages that need protection.

Horizon sensitivity

Under adversarial pressure, longer horizons (64 steps) consistently outperformed short horizons (8 steps) at the same weight. This makes intuitive sense: the adversarial workload revisits pages at gaps of 12–20 steps. A horizon of 8 simply cannot see those returns, so no regret is accumulated and the policy degrades to frequency+recency.

Results: where regret fails

Documenting failure cases is as important as documenting wins. The ablation reveals three distinct failure modes, each illuminating a different structural limitation of the regret mechanism.

Failure mode 1: deterministic sweeps (periodic reuse)

On periodic_reuse under both capacity levels, every regret-aware variant performed identically to LRU — 0% fast-tier hit rate, 640 misses, identical bytes moved. The periodic structure rotates through pages in a fixed cycle that completely overwhelms the fast tier. No finite regret horizon can adapt to a deterministic sweep that always exceeds capacity. Regret accumulates on every page equally, so the signal becomes uniform noise rather than a useful ranking.

Only cost_aware (which penalizes large pages, allowing more small pages to fit) salvaged any hits — achieving 12.5% hit rate under constrained capacity. This is significant: the only policy that helped was the one operating on a different signal dimension entirely (page size), not a smarter version of temporal reasoning.

Failure mode 2: overprotection under extreme pressure

High regret weights can backfire. Under constrained capacity on chat_continuation, the w=1.0 variants achieved only 53% hit rate versus LRU's 66%. What happened? With low fast-tier budget and aggressive regret anchoring, the policy overprotected a handful of previously-important pages that were no longer being accessed, preventing their eviction and blocking newer, genuinely hot pages from entering the fast tier. Stronger weights (w=12.0) partially corrected this by also boosting the frequency component, but the lesson is clear: regret without decay calibration can ossify the cache.

Honest finding: Regret is a reactive signal — it cannot anticipate deterministic sweeps that exceed capacity, and it can overprotect stale pages when miscalibrated. These are not bugs; they are structural limitations of any feedback-only mechanism operating without predictive capability.

Failure mode 3: no eviction pressure (chat continuation, medium)

On purely recency-heavy traces where the working set fits comfortably, LRU achieved 0.23 ms latency versus regret-aware's 0.25 ms. The regret mechanism added slight overhead without benefit — there are simply no regret-inducing eviction-then-re-access patterns when pages rarely get evicted. This is the most benign failure: the policy defaults to recency-like behavior with a small constant-time overhead per access. Under constrained capacity where the working set no longer fits, regret-aware variants with high weights (w=12.0) did recover and approach LRU's hit rate.

Ablation Summary: When Does Regret Help? ✓ REGRET HELPS Adversarial Burst LRU: 2.48 ms → regret: 2.31 ms −6.6% latency, −9.1% bytes moved trace revisits recently-evicted pages → regret accumulates on exactly the right pages best: w=12.0, h=24 (medium) ≈ NEUTRAL Chat Continuation (medium) LRU: 0.23 ms ≈ regret: 0.25 ms working set fits → no evictions no eviction-then-re-access patterns → regret has nothing to learn from degrades safely to recency ✗ REGRET FAILS Periodic Reuse all policies: 0% hit, 640 misses regret = LRU = identical deterministic sweep exceeds capacity regardless of order → no finite horizon helps only cost_aware salvages hits
Figure 3. Ablation summary across three workload categories. Regret helps most when the trace specifically revisits recent eviction victims. It adds nothing on periodic sweeps, and degrades gracefully on recency-dominated traces.

What the baselines teach

Every eviction heuristic answers a different question about a page's value. The comparison is not about declaring a winner — it is about understanding what signal each policy uses and what it is structurally blind to.

PolicyPrimary SignalWhat It CapturesStructural Blindspot
lruLast access timeTemporal locality; optimal when the working set fits in fast tierNo memory of past mistakes; blind to page size; thrashes when capacity is tight
heavy_hitterAccess frequencyLong-term popularity; protects consistently hot pagesSlow to adapt to phase changes; frequency ≠ future reuse
cost_awareFrequency + recency + page sizeMulti-signal composite; penalizes large pages, keeping more small ones residentDoes not learn from eviction outcomes; fixed weight balance
regret_awareFrequency + recency + eviction feedbackAdapts to its own mistakes in real-time; anchors pages with demonstrated re-access demandReactive only — cannot help on deterministic sweeps; adds state overhead; neutral when no evictions occur

The key distinction is that LRU, heavy_hitter, and cost_aware operate on access statistics alone — they never observe the consequences of their eviction decisions. Regret-aware eviction closes that feedback loop: it watches what happens after an eviction and adjusts accordingly. This is the mechanism's entire contribution — and also its limitation, since the feedback is only useful when evictions are both frequent and consequential.

Quantization changes the game

One of the most important findings from the default sweep — not the ablation — is that quantization often dominates policy differences. Under the same small tier budget, switching from fp16 to int4 on a RAG burst workload raised overall hit rate from 0.459 to 0.771 and reduced bytes moved by 93.9%. No policy tuning produces that kind of gain.

This means that in practice, the first lever to pull is often footprint reduction, not eviction sophistication. Regret-aware eviction matters most when quantization has already been applied and residency pressure remains.

Engineering decisions worth noting

Fresh instances per run

Every benchmark run instantiates a fresh policy object. This was a critical correctness fix — an earlier version reused mutable policy instances across runs, contaminating later results with accumulated state from prior workloads. The factory pattern now matches the main benchmark pipeline.

Multi-seed variance

All ablation results report mean ± standard deviation across 3 independent seeds. The workload generators use deterministic seeding, so results are fully reproducible. Even modest seed variance occasionally shifts which configuration "wins" — single-seed ablation results should not be trusted for ordering claims.

Committed artifacts

All simulator outputs are committed to the repository. This includes raw JSON, CSV summaries, aggregated summaries with variance, run metadata (exact parameters used), and plots with error bars. Anyone can regenerate with a single command.

Known limitations

Where this leads

The ablation results are honest about what regret-aware eviction can and cannot do on synthetic traces. The more interesting questions are about what comes next — and they branch in several directions.

From synthetic to real traces. The most immediate credibility gap is the absence of runtime trace ingestion. Synthetic generators produce structurally clean access patterns; real serving traces carry correlation structures, phase transitions, and tail behaviors that no generator captures faithfully. The harness is designed for eventual replay of production traces (e.g., from vLLM's PagedAttention scheduler), and that is the clearest next step toward validating whether regret signals survive contact with real workloads.

Oracle upper bounds. Without a Bélády (optimal offline) baseline, it is impossible to know whether the observed 6.6% latency improvement captures 10% or 80% of the achievable gap over LRU. Adding an offline oracle would transform the results from "regret-aware is better than LRU" into "regret-aware captures X% of the available headroom" — a far more informative statement for policy design.

Hybrid and compression-aware policies. The finding that quantization often dominates policy differences suggests that the next generation of eviction logic should be compression-aware: deciding not just whether to evict a page, but whether to demote, compress, or recompute it. A policy that can choose between "keep at fp16 in tier 0," "compress to int4 and keep," or "evict and pay the re-promotion cost" would operate on a richer action space than pure eviction. Regret signals could naturally extend to that setting — tracking not just eviction regret, but compression regret.

The limits of trace-driven simulation. Finally, it is worth acknowledging that even with perfect traces and calibrated tier models, simulation cannot fully substitute for runtime integration. Transfer costs depend on bandwidth contention, compute overlap, and scheduling context that a synchronous simulator does not model. The harness is useful for comparing policies under controlled conditions and generating hypotheses about where regret signals help — but confirming those hypotheses requires measurement in a real serving stack.

The interesting question is not whether regret-aware eviction is the best KV-cache policy. It is whether the broader idea — that eviction policies should observe the consequences of their own decisions — leads to meaningfully better memory systems. This repo is one small, reproducible experiment in that direction.

Repository

The full harness, all policies, ablation scripts, committed artifacts, and tests are available on GitHub:

Repository: github.com/manishklach/kv-hierarchy-lab
Reproduce the ablation: python scripts/run_regret_ablation.py --output-dir artifacts/regret_ablation
Run tests: python -m pytest