The problem with treating HBM like a page cache
When a long-context LLM is serving dozens of concurrent requests, its GPU high-bandwidth memory (HBM) fills up fast. A Llama-3-8B request with 128k tokens occupies roughly 32–64 GB of KV-cache — more than many A100s hold in total. Under pressure, something has to be evicted.
Classical eviction policies (LRU, clock, FIFO) treat HBM blocks the way a kernel page allocator treats DRAM pages: as interchangeable, anonymous chunks of bytes. The most recently untouched block gets the boot. But a KV-cache block is not anonymous. It belongs to a request, it encodes a specific sequence position, it will be needed at a specific decode step, and evicting it mid-decode causes a latency penalty that can be 10–100× the cost of an ordinary cache miss.
"Generic memory tiering asks: Is this page hot?
KV Deadline Scheduler asks: Which KV block belongs to decode-critical request-state, how close
is it to missing its deadline, and what is the cost of evicting it?"
— from the project README
This is the central insight of kv_deadline_scheduler: you can do dramatically better than LRU by propagating a small amount of semantic intent — deadline, priority, phase, slack — downward from the serving layer to the memory manager.
The bandwidth cliff is real. Evicting a KV block from HBM to NVMe and then reloading it costs roughly 200–5000 µs depending on the tier. For a decode step with a 1 ms deadline, a single bad eviction decision blows the budget entirely. LRU has no way to know that — it just sees a cold access timestamp.
Memory tier bandwidth cliff — eviction across tiers is not symmetric
The MemoryIntent schema: giving blocks a voice
The project's most important contribution is MemoryIntent, the dataclass in
schema.py that every KV block carries through its lifecycle. Instead of tracking
a block by a raw address and an LRU timestamp, it captures:
# src/kv_memory_intent/schema.py (abridged)
@dataclass(slots=True)
class MemoryIntent:
object_id: str # e.g. "req-007:block:14"
request_id: str
phase: Phase # PREFILL | DECODE | VERIFY | DONE …
priority: Priority # COLD | WARM | HOT | DECODE_CRITICAL
deadline_us: int | None # microsecond deadline from now
slack_us: int | None # time before deadline becomes critical
request_priority: int # 0–100 business priority
recompute_cost_us: int # cost to recompute if evicted
target_decode_step: int # which step needs this block
pin_requested: bool # DECODE_CRITICAL auto-pins
compression_ok: bool
recompute_ok: bool
prefetch_ok: bool
Two scoring methods live on the schema itself — effective_deadline_score() measures
urgency to protect, and eviction_risk_score() combines urgency with cost of removal.
This means the data object carries its own risk calculus, not just raw fields.
Priority taxonomy
The four priority levels map directly to what the serving engine knows about a block at any moment:
DECODE_CRITICAL automatically sets pin_requested=True in
__post_init__, making it impossible for a policy to accidentally evict a block
that a decode step is actively depending on — unless every block in HBM is pinned and the
policy has no other choice.
MemoryIntentEvent types and the transitions they drive in KVMemorySimulator
Five policies, one ladder
The codebase implements an ascending ladder of eviction intelligence. Each policy is a subclass of
PlacementPolicy, overriding choose_victim() and optionally
should_prefetch() and explain_victim_choice().
The ladder is designed to be pedagogically clear: each rung adds exactly one new dimension of information.
| Policy | Access history | Predicted hotness | Priority / phase | Deadline / slack | Prefetch |
|---|---|---|---|---|---|
LRU |
✓ | — | — | — | — |
HotCold |
✓ | ✓ | — | — | — |
PredictiveHotness |
✓ | ✓ + reuse window | — | — | — |
IntentAware |
✓ | partial | ✓ | limited | — |
DeadlineAware ★ |
✓ | partial | ✓ | ✓✓ | ✓ |
DeadlineAware scoring — how it works
DeadlineAwarePolicy extends IntentAwarePolicy and computes a floating-point
eviction score for each candidate block. Higher score means "more evictable." Key contributions:
# Deadline nearness — the closer to missing, the more we protect
if block.deadline_us <= 1_000: score -= 320.0 # under 1 ms: heavily shield
elif block.deadline_us <= 5_000: score -= 140.0 # under 5 ms: strong shield
else: score -= 25.0 # more distant: mild shield
# Slack is even stronger when available
if block.slack_us <= 500: score -= 260.0
elif block.slack_us <= 2_000: score -= 120.0
# Phase and priority
if block.phase == Phase.DECODE: score -= 110.0
if block.priority == Priority.DECODE_CRITICAL: score -= 400.0
# Cold DONE blocks are preferred victims
if block.phase == Phase.DONE: score += 40.0
if block.priority == Priority.COLD and block.deadline_us is None: score += 35.0
The explain_victim_choice() method on every policy produces a human-readable string
logging why a specific block was chosen for eviction — enabling offline audit of every decision
the simulator made, down to the individual step.
Relative p99 latency by policy under deadline_pressure workload profile (simulated)
Architecture: observability-first
One of the project's sharpest design choices is the separation between observation and actuation. Phase one — where the project sits now — is purely observational. The serving engine doesn't change. No vLLM patches. No kernel modifications. Just trace ingestion and offline policy replay.
Full system architecture of kv_deadline_scheduler
Intent is emitted per KV block lifecycle, not per page fault. This keeps the overhead low and the design portable across HBM, DRAM, CXL, and NVMe backends without any hot-path callbacks.
External trace import — no serving engine access needed
If you have request logs from any OpenAI-compatible serving proxy, the
trace_importer.py adapter converts them into approximate KV block lifecycle
events. The conversion chain is:
# Import real request logs and replay them
kvmi import-request-trace \
--requests examples/sample_request_trace.jsonl \
--model llama-3-8b \
--out imported_trace.jsonl \
--logical-block-mb 1
kvmi compare \
--trace imported_trace.jsonl \
--hbm-mb 4096 \
--dram-mb 65536
The importer estimates KV block footprint from token counts and model configuration
(ModelKVConfig in kv_estimator.py), reconstructs approximate
PREFILL → DECODE → FREED lifecycle events, and replays them through all five policies
simultaneously, emitting a comparison table of p50/p95/p99 and miss counts.
Five synthetic workload profiles
The generate_synthetic_kv_workload() function in simulator.py produces
realistic-feeling event streams from five named profiles. Each profile tunes the mix of
long-context requests, speculative decode drafts, high-priority SLA-bound requests, and deadline density.
Synthetic workload profile parameter summary
Each profile drives different eviction decisions. In long_context_extreme, most blocks
have no deadline at all (they're deep context that may never be accessed again), making them excellent
eviction candidates even for LRU. In deadline_pressure, 65% of steps involve a block
within 1 ms of its decode deadline — and this is exactly where LRU catastrophically fails, blindly
evicting critical blocks because their last access was just a few steps ago.
Getting started in 60 seconds
The package is a standard Python 3.11+ project with zero runtime dependencies beyond the stdlib. Install it, run the tests, and immediately compare all five policies on a synthetic trace:
# Install
git clone https://github.com/manishklach/kv_deadline_scheduler
cd kv_deadline_scheduler
pip install -e .
pytest
# Estimate KV footprint for a Llama-3-8B 128k context request
kvmi estimate-kv \
--model llama-3-8b \
--prompt-tokens 128000 \
--generated-tokens 1000
# Run all 5 policies against the deadline_pressure synthetic profile
kvmi compare \
--profile deadline_pressure \
--hbm-mb 128 \
--dram-mb 2048 \
--requests 16 \
--decode-steps 256
# Import real request logs and replay
kvmi import-request-trace \
--requests examples/sample_request_trace.jsonl \
--model llama-3-8b \
--out imported_trace.jsonl \
--logical-block-mb 1
kvmi compare \
--trace imported_trace.jsonl \
--hbm-mb 4096 \
--dram-mb 65536
Decision logs are written as JSONL via KVMemorySimulator.write_decision_log(),
giving you a step-by-step record of every eviction: which block was chosen, why,
whether a DECODE_CRITICAL block was avoided, and what the competing candidates were.
The road ahead
The project README is admirably honest: current results are simulated and prototype-oriented. The numbers show policy differences, not production speedups. The roadmap has three natural phases:
Three-phase project roadmap: observe → calibrate → actuate
The key architectural bet is that the same MemoryIntent ABI survives all three phases.
Phase 1 declares intent in offline traces. Phase 2 validates the model against real GPU memory
telemetry. Phase 3 plugs the policy decisions back in — first as soft advisory hints, then as hard
placement decisions. The schema doesn't change; only the enforcement layer is added.
What's strong now
The MemoryIntent schema is production-quality — rich fields, validated invariants, and built-in risk scoring. The policy ladder is pedagogically clean and the decision logs enable full audit. Zero runtime dependencies means easy evaluation.
What's next
Magic numbers in scoring weights need documented rationale. The simulator's recency decay is event-count-based rather than step-based. Calibration against real GPU memory telemetry is the critical gap between prototype and production.
Why it matters
As LLMs push to 1M+ token contexts, KV cache is no longer a background concern — it's the primary bottleneck. A scheduling framework that understands deadlines is the right abstraction at the right time.
Who should try it
Inference platform engineers running vLLM or similar; researchers studying KV reuse and eviction under pressure; anyone who has seen p99 latency spike and suspected a bad eviction decision was to blame.