Part 2: KV Cache Is Not Anonymous Memory

THE PROBLEM LRU CANNOT SOLVE

Long-context LLM inference is a memory scheduling problem wearing a machine learning costume. A single Llama-3-8B request at 128k tokens occupies 32–64 GB of KV cache — more than most A100s hold in total. When dozens of concurrent requests compete for that HBM, something gets evicted.

The classical answer is LRU. Evict the least recently used block. It's simple, it's fast, and it is completely wrong for this workload.

"A KV-cache block is not a page. It belongs to a specific request, encodes a specific sequence position, and will be needed at a specific decode step — usually within the next 1–5 milliseconds. LRU has no way to know any of this."

The result: LRU evicts a block that is needed in 800µs, incurs a 5ms reload penalty from DRAM, and blows the decode deadline entirely. Multiply this by thousands of requests per second and you have a p99 latency disaster hiding inside what looks like a memory capacity problem.

The bandwidth cliff — a 5ms NVMe reload vs a decode deadline of 1ms is not recoverable

THE IDEA

The serving engine already knows which blocks have deadlines. It knows which request is in decode phase, which blocks will be needed in the next step, how much it would cost to recompute evicted state, and whether a block is part of speculative decoding that might be discarded entirely.

The memory manager knows none of this. It sees pages. Timestamps. Access counts. It has no idea that the block it just evicted was scheduled to serve a decode step in 400µs.

The fix is to propagate that intent downward — from the serving layer to the memory manager. Not as a vague hint, but as a structured, versioned schema that every component in the stack can read and act on.

"Generic memory tiering asks: Is this page hot? KV Deadline Scheduler asks: Which block belongs to decode-critical request-state, how close is it to missing its deadline, and what is the cost of evicting it?"

This is kv_deadline_scheduler: a full-stack research system that tests this idea at every layer of the software stack, from Python simulation through Linux kernel RFC patches.

THE SCHEMA

The core contribution is MemoryIntent — a dataclass that every KV block carries through its lifecycle. It turns a anonymous page into a first-class scheduling object.

# Every KV block carries this. Anonymous pages carry nothing.
@dataclass(slots=True)
class MemoryIntent:
    object_id:      str       # "req-007:block:14"
    phase:          Phase     # PREFILL | DECODE | VERIFY | DONE
    priority:       Priority  # COLD | WARM | HOT | DECODE_CRITICAL
    deadline_us:    int       # µs until this block is needed
    slack_us:       int       # µs before deadline becomes critical
    recompute_cost_us: int    # cost to regenerate if evicted
    pin_requested:  bool      # DECODE_CRITICAL auto-pins (invariant)

    def eviction_risk_score(self) -> float:
        # higher = more dangerous to evict
        return self.effective_deadline_score() * self.recompute_cost_us

The __post_init__ invariant is the key safety property: DECODE_CRITICAL blocks automatically set pin_requested=True, making it impossible for any policy to accidentally evict a block that a decode step is actively depending on.

Each lifecycle event is a MemoryIntentEvent fed into the simulator's event loop

THE RESULTS

All five policies were benchmarked across all five workload profiles at 12× HBM pressure (1,536 MB workload vs 128 MB HBM). The headline number from the deadline_pressure profile:

4.5%

LRU decode-critical miss rate

0.28%

DeadlineAware miss rate

21×

p99 latency improvement

Profile	Policy	p99 latency	dc_miss_rate	Evictions
deadline_pressure	lru	5.25ms	0.04494	2,104
deadline_pressure	deadline ★	0.25ms	0.00281	2,020
long_context_extreme	lru	5.25ms	0.01823	3,782
long_context_extreme	intent ★	0.25ms	0.00104	3,716
rag_mixed_priority	lru	5.25ms	0.03168	2,249
rag_mixed_priority	deadline ★	0.25ms	0.00347	2,186

These are simulation results, not production GPU benchmarks. The mechanism is real; the magnitudes are illustrative. Real HBM numbers require a vLLM integration with live telemetry — which is the next step.

LRU, HotCold and PredictiveHotness are blind to decode deadlines — IntentAware and DeadlineAware protect decode-critical blocks

ALL THE WAY DOWN TO THE KERNEL

The simulator validates the idea. The kernel work makes it real. Five distinct kernel experiment tracks, all running on Linux 6.18, producing committed result files.

Real hardware measurements

47%

cache miss rate — KV-random access pattern
(vs 2% sequential)

46%

sequential throughput gain with THP
7,517 → 11,005 MB/s

512×

TLB pressure reduction with 2MB huge pages
2048 → 4 TLB entries / 128k request

Track A

DAMON Hotness Monitor

Live KV hotness via DAMON sysfs. HOT regions → nr_accesses≥10. COLD regions → nr_accesses=0. Feeds directly into kvmi compare --trace.

Track B

userfaultfd Migration

HBM→DRAM migration emulation via uffd. Fault handler measures p50/p95/p99 migration latency — directly calibrates miss_penalty_us in the simulator.

Track C

io_uring Async Prefetch

Raw io_uring syscall (no liburing). Submission queue, completion queue, SQE array — all via ctypes. Async KV block prefetch with p50/p95/p99 tracking.

Track D

THP Huge-Page Allocation

madvise(MADV_HUGEPAGE) on KV regions. 2MB pages: 11,005 MB/s seq, 13.27 MPPS random. 4KB pages: 7,517 MB/s, 8.22 MPPS. Confirmed on Linux 6.18 WSL.

Track E

perf_event_open Counters

Hardware cache counters via perf_event_open. KV-random access: 47.4% miss rate. Evicted KV: 42.0% miss rate. Validates simulator's miss penalty model.

Track F

ioprio Linux I/O Priority

ioprio_set via syscall 251. Decode-critical blocks → IOPRIO_CLASS_RT. Background spill → IOPRIO_CLASS_BE. Max latency 15.5% lower under separation.

RFC patch series — mm/vmscan.c

mm: add experimental memory intent debugfs registry

mm/memory_intent.c · include/linux/memory_intent.h · mm/Kconfig
Adds /sys/kernel/debug/mm_intent/{register,dump,clear} — observability only, zero reclaim change

mm: proc: expose memory intent for monitored regions

/proc observability for registered intent regions · research-only ABI

mm: damon: report memory intent alongside access frequency

DAMON integration — correlates nr_accesses with registered intent priority · TRACE_EVENT tracepoint

mm: vmscan: experimental intent-aware reclaim — default off

DEFINE_STATIC_KEY_FALSE — zero overhead when disabled · consults intent before eviction · never evicts DECODE_CRITICAL+pinned blocks

/* Patch 4 — the key addition to shrink_page_list() */
if (memory_intent_reclaim_enabled()) {
    if (memory_intent_is_pinned(page_to_pfn(page)))
        goto keep_locked;  /* never evict DECODE_CRITICAL */

    prio = memory_intent_get_priority(page_to_pfn(page));
    if (prio == KV_PRIORITY_DECODE_CRITICAL && !sc->may_swap)
        goto keep_locked;
}

THE FULL STACK

End-to-end system — no vLLM changes required to start generating MemoryIntentEvent traces

WHY THIS MATTERS

The KV cache problem scales with context length — and context lengths are getting longer, fast. GPT-4o, Claude, Gemini, and every major model now supports 100k+ tokens. Kimi K2 runs at 1 million. At these scales, KV cache is not a background concern. It is the primary bottleneck.

The existing approaches — PagedAttention, LMCache, DistKV — solve the capacity problem. They make KV cache fit. What they don't solve is the scheduling problem: given that something must be evicted, which block is safe to lose?

That's the white space this project occupies. And the answer requires intent — knowledge that only the serving engine has, propagated downward to where eviction decisions are made.

For inference engineers

If your p99 latency spikes under memory pressure and you don't know why, a bad eviction decision is the most likely cause. This is the diagnostic and the fix.

For systems researchers

The MemoryIntent schema, kernel module, and RFC patch series are a complete research artifact. Every claim is backed by code and committed result files.

For OS / MM engineers

The 4-patch RFC is formatted for linux-mm submission. The DAMON scheme driver extends DAMON's official extension point. The shrinker module compiles against a 6.x kernel tree.

For the long term

KV cache today. Database buffer pools, ML weight serving, real-time media pipelines tomorrow. Any latency-critical memory workload has this problem. Memory Intent is the generic fix.

GET STARTED

Zero runtime dependencies. Pure Python 3.11+. One command to see the core result:

# Install
git clone https://github.com/manishklach/kv_deadline_scheduler
cd kv_deadline_scheduler && pip install -e .

# Run the full policy sweep — see LRU vs DeadlineAware under 12× HBM pressure
python examples/benchmark_sweep.py \
  --seed 42 --requests 64 --blocks-per-req 24 \
  --block-size-mb 1 --hbm-mb 128 --dram-mb 4096

# Compare policies on a real request trace
kvmi import-request-trace --requests examples/sample_request_trace.jsonl \
  --model llama-3-8b --out trace.jsonl
kvmi compare --trace trace.jsonl --hbm-mb 4096

⚙

github.com/manishklach/kv_deadline_scheduler

MIT · Python 3.11+ · Linux 6.x · v0.6.0 · 53 tests passing

★ Star the repo →