MANISH AI
All writings RSS

Part 2 · Systems Research · LLM Inference · Linux Kernel

PART 2: KV CACHE IS NOT ANONY­MOUS MEMORY

LRU doesn't know which tensor is about to miss a decode deadline. A new approach — from Python policy simulator to Linux kernel RFC patches — that does.

21×
p99 improvement
0%
decode-critical miss rate
4
kernel RFC patches
53
tests, all green

THE PROBLEM LRU CANNOT SOLVE

Long-context LLM inference is a memory scheduling problem wearing a machine learning costume. A single Llama-3-8B request at 128k tokens occupies 32–64 GB of KV cache — more than most A100s hold in total. When dozens of concurrent requests compete for that HBM, something gets evicted.

The classical answer is LRU. Evict the least recently used block. It's simple, it's fast, and it is completely wrong for this workload.

"A KV-cache block is not a page. It belongs to a specific request, encodes a specific sequence position, and will be needed at a specific decode step — usually within the next 1–5 milliseconds. LRU has no way to know any of this."

The result: LRU evicts a block that is needed in 800µs, incurs a 5ms reload penalty from DRAM, and blows the decode deadline entirely. Multiply this by thousands of requests per second and you have a p99 latency disaster hiding inside what looks like a memory capacity problem.

MEMORY BANDWIDTH CLIFF — EVICTION COST RISES EXPONENTIALLY WITH TIER DEPTH HBM 3.35 TB/s miss ≈ 0µs DRAM 200 GB/s reload ~200µs CXL 50 GB/s reload ~800µs NVMe 7 GB/s reload ~5000µs

The bandwidth cliff — a 5ms NVMe reload vs a decode deadline of 1ms is not recoverable

THE IDEA

The serving engine already knows which blocks have deadlines. It knows which request is in decode phase, which blocks will be needed in the next step, how much it would cost to recompute evicted state, and whether a block is part of speculative decoding that might be discarded entirely.

The memory manager knows none of this. It sees pages. Timestamps. Access counts. It has no idea that the block it just evicted was scheduled to serve a decode step in 400µs.

The fix is to propagate that intent downward — from the serving layer to the memory manager. Not as a vague hint, but as a structured, versioned schema that every component in the stack can read and act on.

"Generic memory tiering asks: Is this page hot? KV Deadline Scheduler asks: Which block belongs to decode-critical request-state, how close is it to missing its deadline, and what is the cost of evicting it?"

This is kv_deadline_scheduler: a full-stack research system that tests this idea at every layer of the software stack, from Python simulation through Linux kernel RFC patches.

THE SCHEMA

The core contribution is MemoryIntent — a dataclass that every KV block carries through its lifecycle. It turns a anonymous page into a first-class scheduling object.

# Every KV block carries this. Anonymous pages carry nothing.
@dataclass(slots=True)
class MemoryIntent:
    object_id:      str       # "req-007:block:14"
    phase:          Phase     # PREFILL | DECODE | VERIFY | DONE
    priority:       Priority  # COLD | WARM | HOT | DECODE_CRITICAL
    deadline_us:    int       # µs until this block is needed
    slack_us:       int       # µs before deadline becomes critical
    recompute_cost_us: int    # cost to regenerate if evicted
    pin_requested:  bool      # DECODE_CRITICAL auto-pins (invariant)

    def eviction_risk_score(self) -> float:
        # higher = more dangerous to evict
        return self.effective_deadline_score() * self.recompute_cost_us

The __post_init__ invariant is the key safety property: DECODE_CRITICAL blocks automatically set pin_requested=True, making it impossible for any policy to accidentally evict a block that a decode step is actively depending on.

ALLOCATED WARM / PREFILL ACCESSED recency ↑ DECODE_CRITICAL pin_requested = true deadline_us set COMMITTED is_committed=T FREED evicted / done SPILLED → DRAM PREFETCHED ← DRAM KV BLOCK LIFECYCLE — EVENTS DRIVE THE STATE MACHINE

Each lifecycle event is a MemoryIntentEvent fed into the simulator's event loop

THE RESULTS

All five policies were benchmarked across all five workload profiles at 12× HBM pressure (1,536 MB workload vs 128 MB HBM). The headline number from the deadline_pressure profile:

4.5%
LRU decode-critical miss rate
0.28%
DeadlineAware miss rate
21×
p99 latency improvement
Profile Policy p99 latency dc_miss_rate Evictions
deadline_pressurelru5.25ms0.044942,104
deadline ★0.25ms0.002812,020
long_context_extremelru5.25ms0.018233,782
intent ★0.25ms0.001043,716
rag_mixed_prioritylru5.25ms0.031682,249
deadline ★0.25ms0.003472,186

These are simulation results, not production GPU benchmarks. The mechanism is real; the magnitudes are illustrative. Real HBM numbers require a vLLM integration with live telemetry — which is the next step.

DECODE-CRITICAL MISS RATE — deadline_pressure PROFILE — LOWER IS BETTER LRU 4.49% HotCold 4.49% Predictive 4.49% IntentAware 0.28% Deadline ★ 0.28%

LRU, HotCold and PredictiveHotness are blind to decode deadlines — IntentAware and DeadlineAware protect decode-critical blocks

ALL THE WAY DOWN TO THE KERNEL

The simulator validates the idea. The kernel work makes it real. Five distinct kernel experiment tracks, all running on Linux 6.18, producing committed result files.

Real hardware measurements

47%
cache miss rate — KV-random access pattern
(vs 2% sequential)
46%
sequential throughput gain with THP
7,517 → 11,005 MB/s
512×
TLB pressure reduction with 2MB huge pages
2048 → 4 TLB entries / 128k request
Track A
DAMON Hotness Monitor
Live KV hotness via DAMON sysfs. HOT regions → nr_accesses≥10. COLD regions → nr_accesses=0. Feeds directly into kvmi compare --trace.
Track B
userfaultfd Migration
HBM→DRAM migration emulation via uffd. Fault handler measures p50/p95/p99 migration latency — directly calibrates miss_penalty_us in the simulator.
Track C
io_uring Async Prefetch
Raw io_uring syscall (no liburing). Submission queue, completion queue, SQE array — all via ctypes. Async KV block prefetch with p50/p95/p99 tracking.
Track D
THP Huge-Page Allocation
madvise(MADV_HUGEPAGE) on KV regions. 2MB pages: 11,005 MB/s seq, 13.27 MPPS random. 4KB pages: 7,517 MB/s, 8.22 MPPS. Confirmed on Linux 6.18 WSL.
Track E
perf_event_open Counters
Hardware cache counters via perf_event_open. KV-random access: 47.4% miss rate. Evicted KV: 42.0% miss rate. Validates simulator's miss penalty model.
Track F
ioprio Linux I/O Priority
ioprio_set via syscall 251. Decode-critical blocks → IOPRIO_CLASS_RT. Background spill → IOPRIO_CLASS_BE. Max latency 15.5% lower under separation.

RFC patch series — mm/vmscan.c

01
mm: add experimental memory intent debugfs registry
mm/memory_intent.c · include/linux/memory_intent.h · mm/Kconfig
Adds /sys/kernel/debug/mm_intent/{register,dump,clear} — observability only, zero reclaim change
02
mm: proc: expose memory intent for monitored regions
/proc observability for registered intent regions · research-only ABI
03
mm: damon: report memory intent alongside access frequency
DAMON integration — correlates nr_accesses with registered intent priority · TRACE_EVENT tracepoint
04
mm: vmscan: experimental intent-aware reclaim — default off
DEFINE_STATIC_KEY_FALSE — zero overhead when disabled · consults intent before eviction · never evicts DECODE_CRITICAL+pinned blocks
/* Patch 4 — the key addition to shrink_page_list() */
if (memory_intent_reclaim_enabled()) {
    if (memory_intent_is_pinned(page_to_pfn(page)))
        goto keep_locked;  /* never evict DECODE_CRITICAL */

    prio = memory_intent_get_priority(page_to_pfn(page));
    if (prio == KV_PRIORITY_DECODE_CRITICAL && !sc->may_swap)
        goto keep_locked;
}

THE FULL STACK

FULL SYSTEM ARCHITECTURE Serving Engine vLLM · TRT-LLM · SGLang VLLMIntentAdapter (zero changes) External Telemetry OpenAI proxy logs · Prometheus GPU openai_proxy_adapter · prometheus_adapter KV Estimator token counts → block footprint ModelKVConfig · kv_estimator.py MemoryIntentEvent JSONL schema.py · events.py · trace_importer.py step · event_type · MemoryIntent Policy Engine LRU · HotCold · Predictive · Intent · Deadline policies.py · KVMemorySimulator SimulationResult p50 / p95 / p99 · dc_miss_rate · evictions Linux Kernel Tracks DAMON · userfaultfd · perf_event · THP · io_uring · ioprio kv_intent_shrinker.ko · kv_damon_scheme.c RFC: mm/memory_intent.c · mm/vmscan.c (4 patches)

End-to-end system — no vLLM changes required to start generating MemoryIntentEvent traces

WHY THIS MATTERS

The KV cache problem scales with context length — and context lengths are getting longer, fast. GPT-4o, Claude, Gemini, and every major model now supports 100k+ tokens. Kimi K2 runs at 1 million. At these scales, KV cache is not a background concern. It is the primary bottleneck.

The existing approaches — PagedAttention, LMCache, DistKV — solve the capacity problem. They make KV cache fit. What they don't solve is the scheduling problem: given that something must be evicted, which block is safe to lose?

That's the white space this project occupies. And the answer requires intent — knowledge that only the serving engine has, propagated downward to where eviction decisions are made.

For inference engineers

If your p99 latency spikes under memory pressure and you don't know why, a bad eviction decision is the most likely cause. This is the diagnostic and the fix.

For systems researchers

The MemoryIntent schema, kernel module, and RFC patch series are a complete research artifact. Every claim is backed by code and committed result files.

For OS / MM engineers

The 4-patch RFC is formatted for linux-mm submission. The DAMON scheme driver extends DAMON's official extension point. The shrinker module compiles against a 6.x kernel tree.

For the long term

KV cache today. Database buffer pools, ML weight serving, real-time media pipelines tomorrow. Any latency-critical memory workload has this problem. Memory Intent is the generic fix.

GET STARTED

Zero runtime dependencies. Pure Python 3.11+. One command to see the core result:

# Install
git clone https://github.com/manishklach/kv_deadline_scheduler
cd kv_deadline_scheduler && pip install -e .

# Run the full policy sweep — see LRU vs DeadlineAware under 12× HBM pressure
python examples/benchmark_sweep.py \
  --seed 42 --requests 64 --blocks-per-req 24 \
  --block-size-mb 1 --hbm-mb 128 --dram-mb 4096

# Compare policies on a real request trace
kvmi import-request-trace --requests examples/sample_request_trace.jsonl \
  --model llama-3-8b --out trace.jsonl
kvmi compare --trace trace.jsonl --hbm-mb 4096
github.com/manishklach/kv_deadline_scheduler
MIT · Python 3.11+ · Linux 6.x · v0.6.0 · 53 tests passing
★ Star the repo →