THE PROBLEM LRU CANNOT SOLVE
Long-context LLM inference is a memory scheduling problem wearing a machine learning costume. A single Llama-3-8B request at 128k tokens occupies 32–64 GB of KV cache — more than most A100s hold in total. When dozens of concurrent requests compete for that HBM, something gets evicted.
The classical answer is LRU. Evict the least recently used block. It's simple, it's fast, and it is completely wrong for this workload.
"A KV-cache block is not a page. It belongs to a specific request, encodes a specific sequence position, and will be needed at a specific decode step — usually within the next 1–5 milliseconds. LRU has no way to know any of this."
The result: LRU evicts a block that is needed in 800µs, incurs a 5ms reload penalty from DRAM, and blows the decode deadline entirely. Multiply this by thousands of requests per second and you have a p99 latency disaster hiding inside what looks like a memory capacity problem.
The bandwidth cliff — a 5ms NVMe reload vs a decode deadline of 1ms is not recoverable
THE IDEA
The serving engine already knows which blocks have deadlines. It knows which request is in decode phase, which blocks will be needed in the next step, how much it would cost to recompute evicted state, and whether a block is part of speculative decoding that might be discarded entirely.
The memory manager knows none of this. It sees pages. Timestamps. Access counts. It has no idea that the block it just evicted was scheduled to serve a decode step in 400µs.
The fix is to propagate that intent downward — from the serving layer to the memory manager. Not as a vague hint, but as a structured, versioned schema that every component in the stack can read and act on.
"Generic memory tiering asks: Is this page hot? KV Deadline Scheduler asks: Which block belongs to decode-critical request-state, how close is it to missing its deadline, and what is the cost of evicting it?"
This is kv_deadline_scheduler: a full-stack research system that tests this idea at every layer of the software stack, from Python simulation through Linux kernel RFC patches.
THE SCHEMA
The core contribution is MemoryIntent — a dataclass that every KV block
carries through its lifecycle. It turns a anonymous page into a first-class scheduling object.
# Every KV block carries this. Anonymous pages carry nothing.
@dataclass(slots=True)
class MemoryIntent:
object_id: str # "req-007:block:14"
phase: Phase # PREFILL | DECODE | VERIFY | DONE
priority: Priority # COLD | WARM | HOT | DECODE_CRITICAL
deadline_us: int # µs until this block is needed
slack_us: int # µs before deadline becomes critical
recompute_cost_us: int # cost to regenerate if evicted
pin_requested: bool # DECODE_CRITICAL auto-pins (invariant)
def eviction_risk_score(self) -> float:
# higher = more dangerous to evict
return self.effective_deadline_score() * self.recompute_cost_us
The __post_init__ invariant is the key safety property:
DECODE_CRITICAL blocks automatically set pin_requested=True,
making it impossible for any policy to accidentally evict a block that a decode step
is actively depending on.
Each lifecycle event is a MemoryIntentEvent fed into the simulator's event loop
THE RESULTS
All five policies were benchmarked across all five workload profiles at
12× HBM pressure (1,536 MB workload vs 128 MB HBM).
The headline number from the deadline_pressure profile:
| Profile | Policy | p99 latency | dc_miss_rate | Evictions |
|---|---|---|---|---|
| deadline_pressure | lru | 5.25ms | 0.04494 | 2,104 |
| deadline ★ | 0.25ms | 0.00281 | 2,020 | |
| long_context_extreme | lru | 5.25ms | 0.01823 | 3,782 |
| intent ★ | 0.25ms | 0.00104 | 3,716 | |
| rag_mixed_priority | lru | 5.25ms | 0.03168 | 2,249 |
| deadline ★ | 0.25ms | 0.00347 | 2,186 |
These are simulation results, not production GPU benchmarks. The mechanism is real; the magnitudes are illustrative. Real HBM numbers require a vLLM integration with live telemetry — which is the next step.
LRU, HotCold and PredictiveHotness are blind to decode deadlines — IntentAware and DeadlineAware protect decode-critical blocks
ALL THE WAY DOWN TO THE KERNEL
The simulator validates the idea. The kernel work makes it real. Five distinct kernel experiment tracks, all running on Linux 6.18, producing committed result files.
Real hardware measurements
(vs 2% sequential)
7,517 → 11,005 MB/s
2048 → 4 TLB entries / 128k request
RFC patch series — mm/vmscan.c
Adds /sys/kernel/debug/mm_intent/{register,dump,clear} — observability only, zero reclaim change
/* Patch 4 — the key addition to shrink_page_list() */
if (memory_intent_reclaim_enabled()) {
if (memory_intent_is_pinned(page_to_pfn(page)))
goto keep_locked; /* never evict DECODE_CRITICAL */
prio = memory_intent_get_priority(page_to_pfn(page));
if (prio == KV_PRIORITY_DECODE_CRITICAL && !sc->may_swap)
goto keep_locked;
}
THE FULL STACK
End-to-end system — no vLLM changes required to start generating MemoryIntentEvent traces
WHY THIS MATTERS
The KV cache problem scales with context length — and context lengths are getting longer, fast. GPT-4o, Claude, Gemini, and every major model now supports 100k+ tokens. Kimi K2 runs at 1 million. At these scales, KV cache is not a background concern. It is the primary bottleneck.
The existing approaches — PagedAttention, LMCache, DistKV — solve the capacity problem. They make KV cache fit. What they don't solve is the scheduling problem: given that something must be evicted, which block is safe to lose?
That's the white space this project occupies. And the answer requires intent — knowledge that only the serving engine has, propagated downward to where eviction decisions are made.
For inference engineers
If your p99 latency spikes under memory pressure and you don't know why, a bad eviction decision is the most likely cause. This is the diagnostic and the fix.
For systems researchers
The MemoryIntent schema, kernel module, and RFC patch series are a complete research artifact. Every claim is backed by code and committed result files.
For OS / MM engineers
The 4-patch RFC is formatted for linux-mm submission. The DAMON scheme driver extends DAMON's official extension point. The shrinker module compiles against a 6.x kernel tree.
For the long term
KV cache today. Database buffer pools, ML weight serving, real-time media pipelines tomorrow. Any latency-critical memory workload has this problem. Memory Intent is the generic fix.
GET STARTED
Zero runtime dependencies. Pure Python 3.11+. One command to see the core result:
# Install
git clone https://github.com/manishklach/kv_deadline_scheduler
cd kv_deadline_scheduler && pip install -e .
# Run the full policy sweep — see LRU vs DeadlineAware under 12× HBM pressure
python examples/benchmark_sweep.py \
--seed 42 --requests 64 --blocks-per-req 24 \
--block-size-mb 1 --hbm-mb 128 --dram-mb 4096
# Compare policies on a real request trace
kvmi import-request-trace --requests examples/sample_request_trace.jsonl \
--model llama-3-8b --out trace.jsonl
kvmi compare --trace trace.jsonl --hbm-mb 4096