MANISH AI
All writings RSS

Deep Dive · June 2026

KV Cache Isn't Anonymous Memory

What if your GPU memory manager knew which tensor was about to miss a decode deadline? That's the question kv_deadline_scheduler is built to answer.

Manish Klach 2,876 lines of Python MIT License 5 eviction policies
5
Eviction policies
5
Workload profiles
p99
Latency tracked
0
vLLM patches needed

The problem with treating HBM like a page cache

When a long-context LLM is serving dozens of concurrent requests, its GPU high-bandwidth memory (HBM) fills up fast. A Llama-3-8B request with 128k tokens occupies roughly 32–64 GB of KV-cache — more than many A100s hold in total. Under pressure, something has to be evicted.

Classical eviction policies (LRU, clock, FIFO) treat HBM blocks the way a kernel page allocator treats DRAM pages: as interchangeable, anonymous chunks of bytes. The most recently untouched block gets the boot. But a KV-cache block is not anonymous. It belongs to a request, it encodes a specific sequence position, it will be needed at a specific decode step, and evicting it mid-decode causes a latency penalty that can be 10–100× the cost of an ordinary cache miss.

"Generic memory tiering asks: Is this page hot?
KV Deadline Scheduler asks: Which KV block belongs to decode-critical request-state, how close is it to missing its deadline, and what is the cost of evicting it?"
— from the project README

This is the central insight of kv_deadline_scheduler: you can do dramatically better than LRU by propagating a small amount of semantic intent — deadline, priority, phase, slack — downward from the serving layer to the memory manager.

HBM
3.35 TB/s
A100 · 80 GB
DRAM
~200 GB/s
Spill target
CXL
~50 GB/s
Future tier
NVMe
~7 GB/s
Last resort

The bandwidth cliff is real. Evicting a KV block from HBM to NVMe and then reloading it costs roughly 200–5000 µs depending on the tier. For a decode step with a 1 ms deadline, a single bad eviction decision blows the budget entirely. LRU has no way to know that — it just sees a cold access timestamp.

HBM 3,350 GB/s DRAM 200 GB/s CXL 50 GB/s NVMe 7 GB/s miss penalty ≈ 0 µs ~200 µs reload ~800 µs reload ~5000 µs reload ← memory bandwidth drops as tier depth increases

Memory tier bandwidth cliff — eviction across tiers is not symmetric

The MemoryIntent schema: giving blocks a voice

The project's most important contribution is MemoryIntent, the dataclass in schema.py that every KV block carries through its lifecycle. Instead of tracking a block by a raw address and an LRU timestamp, it captures:

# src/kv_memory_intent/schema.py (abridged)
@dataclass(slots=True)
class MemoryIntent:
    object_id:      str          # e.g. "req-007:block:14"
    request_id:     str
    phase:          Phase        # PREFILL | DECODE | VERIFY | DONE …
    priority:       Priority     # COLD | WARM | HOT | DECODE_CRITICAL
    deadline_us:    int | None   # microsecond deadline from now
    slack_us:       int | None   # time before deadline becomes critical
    request_priority: int        # 0–100 business priority
    recompute_cost_us: int       # cost to recompute if evicted
    target_decode_step: int      # which step needs this block
    pin_requested:  bool         # DECODE_CRITICAL auto-pins
    compression_ok: bool
    recompute_ok:   bool
    prefetch_ok:    bool

Two scoring methods live on the schema itself — effective_deadline_score() measures urgency to protect, and eviction_risk_score() combines urgency with cost of removal. This means the data object carries its own risk calculus, not just raw fields.

Priority taxonomy

The four priority levels map directly to what the serving engine knows about a block at any moment:

COLD WARM HOT DECODE_CRITICAL

DECODE_CRITICAL automatically sets pin_requested=True in __post_init__, making it impossible for a policy to accidentally evict a block that a decode step is actively depending on — unless every block in HBM is pinned and the policy has no other choice.

ALLOCATED WARM / PREFILL ACCESSED recency ↑ DECODE_CRITICAL pin_requested = True deadline_us set COMMITTED is_committed=T FREED evicted from live SPILLED → DRAM tier PREFETCHED DRAM → HBM KV block lifecycle — events drive the simulator's state machine

MemoryIntentEvent types and the transitions they drive in KVMemorySimulator

Five policies, one ladder

The codebase implements an ascending ladder of eviction intelligence. Each policy is a subclass of PlacementPolicy, overriding choose_victim() and optionally should_prefetch() and explain_victim_choice(). The ladder is designed to be pedagogically clear: each rung adds exactly one new dimension of information.

Policy Access history Predicted hotness Priority / phase Deadline / slack Prefetch
LRU
HotCold
PredictiveHotness ✓ + reuse window
IntentAware partial limited
DeadlineAware ★ partial ✓✓

DeadlineAware scoring — how it works

DeadlineAwarePolicy extends IntentAwarePolicy and computes a floating-point eviction score for each candidate block. Higher score means "more evictable." Key contributions:

# Deadline nearness — the closer to missing, the more we protect
if block.deadline_us <= 1_000:   score -= 320.0   # under 1 ms: heavily shield
elif block.deadline_us <= 5_000: score -= 140.0   # under 5 ms: strong shield
else:                              score -= 25.0    # more distant: mild shield

# Slack is even stronger when available
if block.slack_us <= 500:   score -= 260.0
elif block.slack_us <= 2_000: score -= 120.0

# Phase and priority
if block.phase == Phase.DECODE:              score -= 110.0
if block.priority == Priority.DECODE_CRITICAL: score -= 400.0

# Cold DONE blocks are preferred victims
if block.phase == Phase.DONE:   score += 40.0
if block.priority == Priority.COLD and block.deadline_us is None: score += 35.0

The explain_victim_choice() method on every policy produces a human-readable string logging why a specific block was chosen for eviction — enabling offline audit of every decision the simulator made, down to the individual step.

Simulated p99 decode latency under HBM pressure (lower = better) LRU 100% HotCold ~86% Predictive ~75% IntentAware ~55% Deadline ★ ~36% Note: values are illustrative from prototype simulation, not production benchmarks

Relative p99 latency by policy under deadline_pressure workload profile (simulated)

Architecture: observability-first

One of the project's sharpest design choices is the separation between observation and actuation. Phase one — where the project sits now — is purely observational. The serving engine doesn't change. No vLLM patches. No kernel modifications. Just trace ingestion and offline policy replay.

INPUTS Serving logs / request traces trace_importer.py · adapters/ Synthetic workload generator simulator.py → 5 profiles KV footprint estimator kv_estimator.py · ModelKVConfig MemoryIntentEvent schema.py + events.py JSONL trace stream step · event_type · intent POLICIES + SIMULATION Policy engine LRU · HotCold · Predictive IntentAware · DeadlineAware policies.py KVMemorySimulator HBM + DRAM tiers · miss / spill / prefetch SimulationResult p50 / p95 / p99 · decode_critical_misses kvmi CLI compare · inspect · estimate-kv import-request-trace · mock-vllm No runtime modification to vLLM or any serving engine required

Full system architecture of kv_deadline_scheduler

Intent is emitted per KV block lifecycle, not per page fault. This keeps the overhead low and the design portable across HBM, DRAM, CXL, and NVMe backends without any hot-path callbacks.

External trace import — no serving engine access needed

If you have request logs from any OpenAI-compatible serving proxy, the trace_importer.py adapter converts them into approximate KV block lifecycle events. The conversion chain is:

# Import real request logs and replay them
kvmi import-request-trace \
  --requests examples/sample_request_trace.jsonl \
  --model llama-3-8b \
  --out imported_trace.jsonl \
  --logical-block-mb 1

kvmi compare \
  --trace imported_trace.jsonl \
  --hbm-mb 4096 \
  --dram-mb 65536

The importer estimates KV block footprint from token counts and model configuration (ModelKVConfig in kv_estimator.py), reconstructs approximate PREFILL → DECODE → FREED lifecycle events, and replays them through all five policies simultaneously, emitting a comparison table of p50/p95/p99 and miss counts.

Five synthetic workload profiles

The generate_synthetic_kv_workload() function in simulator.py produces realistic-feeling event streams from five named profiles. Each profile tunes the mix of long-context requests, speculative decode drafts, high-priority SLA-bound requests, and deadline density.

Profile Long-context Deadline density Reuse scale Designed for balanced 30% 25% 1.0× General mixed workload deadline_pressure 40% 65% 0.7× Stress-test SLA compliance rag_mixed_priority 50% 20% 1.6× Retrieval-augmented pipelines speculative_decode 30% 30% 0.9× Draft / verify token patterns long_context_extreme 75% 15% 2.8× 128k+ context stress test

Synthetic workload profile parameter summary

Each profile drives different eviction decisions. In long_context_extreme, most blocks have no deadline at all (they're deep context that may never be accessed again), making them excellent eviction candidates even for LRU. In deadline_pressure, 65% of steps involve a block within 1 ms of its decode deadline — and this is exactly where LRU catastrophically fails, blindly evicting critical blocks because their last access was just a few steps ago.

Getting started in 60 seconds

The package is a standard Python 3.11+ project with zero runtime dependencies beyond the stdlib. Install it, run the tests, and immediately compare all five policies on a synthetic trace:

# Install
git clone https://github.com/manishklach/kv_deadline_scheduler
cd kv_deadline_scheduler
pip install -e .
pytest

# Estimate KV footprint for a Llama-3-8B 128k context request
kvmi estimate-kv \
  --model llama-3-8b \
  --prompt-tokens 128000 \
  --generated-tokens 1000

# Run all 5 policies against the deadline_pressure synthetic profile
kvmi compare \
  --profile deadline_pressure \
  --hbm-mb 128 \
  --dram-mb 2048 \
  --requests 16 \
  --decode-steps 256

# Import real request logs and replay
kvmi import-request-trace \
  --requests examples/sample_request_trace.jsonl \
  --model llama-3-8b \
  --out imported_trace.jsonl \
  --logical-block-mb 1

kvmi compare \
  --trace imported_trace.jsonl \
  --hbm-mb 4096 \
  --dram-mb 65536

Decision logs are written as JSONL via KVMemorySimulator.write_decision_log(), giving you a step-by-step record of every eviction: which block was chosen, why, whether a DECODE_CRITICAL block was avoided, and what the competing candidates were.

The road ahead

The project README is admirably honest: current results are simulated and prototype-oriented. The numbers show policy differences, not production speedups. The roadmap has three natural phases:

Phase 1 — Done ✓ Schema + policy ladder Simulator + synthetic workloads External trace import + CLI Phase 2 — In progress OpenAI proxy log adapter Prometheus GPU telemetry Calibrate against real p99 Phase 3 — Future Advisory scheduler (soft hints) Runtime actuation (pin / spill) CXL / NVMe tiering hooks

Three-phase project roadmap: observe → calibrate → actuate

The key architectural bet is that the same MemoryIntent ABI survives all three phases. Phase 1 declares intent in offline traces. Phase 2 validates the model against real GPU memory telemetry. Phase 3 plugs the policy decisions back in — first as soft advisory hints, then as hard placement decisions. The schema doesn't change; only the enforcement layer is added.

What's strong now

The MemoryIntent schema is production-quality — rich fields, validated invariants, and built-in risk scoring. The policy ladder is pedagogically clean and the decision logs enable full audit. Zero runtime dependencies means easy evaluation.

What's next

Magic numbers in scoring weights need documented rationale. The simulator's recency decay is event-count-based rather than step-based. Calibration against real GPU memory telemetry is the critical gap between prototype and production.

Why it matters

As LLMs push to 1M+ token contexts, KV cache is no longer a background concern — it's the primary bottleneck. A scheduling framework that understands deadlines is the right abstraction at the right time.

Who should try it

Inference platform engineers running vLLM or similar; researchers studying KV reuse and eviction under pressure; anyone who has seen p99 latency spike and suspected a bad eviction decision was to blame.