KV-Direct: Why Distributed KV Cache Needs a Linux Fast Path

In this essay

Thesis What Mooncake gets right Why generic I/O loses KV-Direct proposal Prototype plan What to measure

Distributed KV cache changes the problem shape. Once the industry stops treating KV as an ephemeral byproduct of one GPU server and starts treating it as a reusable cluster object, the next constraint is the operating system path that has to fetch, batch, steer, and land that object.

That is why this is now a Linux essay as much as a serving essay. In agentic inference, the expensive thing is often not recomputing one prompt once. It is repeatedly restoring long, fragmented prefix state under latency pressure while the model is waiting for the next useful block to become visible in the right destination arena.

Core claim: distributed KV cache solves placement, naming, and reuse. The next material win is a kernel fast path that treats KV restore as a first-class inference data movement problem rather than a pile of unrelated file reads and transport completions.

That distinction matters because decode does not care that an SSD completed a request in isolation. Decode cares that the useful KV range became available, on time, on the right CPU or GPU path, without extra page-cache churn, registration churn, or scheduler noise.

1. What Mooncake Store gets right

Mooncake Store is directionally right because it acknowledges something local KV caches hide poorly: long-context systems are no longer dominated by one request staying attached to one worker. Prefill can happen on one machine, decode on another, and future turns may revisit the same system prompt, tool trace, or retrieved corpus over and over.

Reuse

KV becomes an addressable object

Instead of assuming every worker has to reconstruct the same prefix, Mooncake identifies blocks by content-derived identity and makes them discoverable across nodes.

Movement

Data no longer has to stay local

Once KV state can live in remote DRAM, SSD, or another worker, the serving layer gets a new degree of freedom: fetch versus recompute becomes a runtime choice.

This is a big architectural change. It turns KV cache into part of the memory hierarchy. And once something becomes part of the memory hierarchy, the mundane things matter again: queue steering, completion locality, direct I/O, pinned memory lifetime, and whether the next block lands where the consumer can actually use it.

2. Why generic file and network I/O starts to lose

The first instinct is to say: "We already have read(), mmap(), io_uring, RDMA, and GPUDirect-style paths. Why invent another abstraction?" The answer is that all of those mechanisms are lower-level than the policy problem we now care about.

Generic Linux I/O assumes that if bytes eventually arrive, the contract is mostly satisfied. Distributed KV restore has a stricter contract:

many small or medium KV blocks belong to one logical prefix restore;
some of those blocks are decode-critical while others can lag;
the destination is often a long-lived structured arena, not an arbitrary buffer;
the useful metric is not request completion but decode readiness.

Figure 1. The hard part is not issuing I/O. It is preserving usefulness across the whole path.

Layer	Generic behavior	Why it wastes time for KV restore
Page cache	Optimizes for broad file reuse and buffered ownership	KV is already a managed cache, so double-caching burns memory and muddies locality
Submission granularity	Many individual reads or messages	Prefix restore becomes descriptor-heavy and completion-heavy before devices saturate
Completion semantics	Request-oriented	The runtime wants "enough useful blocks are ready," not thousands of tiny unrelated done signals
Wakeup policy	Generic scheduler resumption	Wrong-core wakeups and remote memory ownership widen p99 restore latency
Registration	Often rediscovered or rebuilt at flow boundaries	Repeated pin/map work becomes real CPU cost in stable serving pipelines

3. The missing layer is semantic, not just faster transport

It is tempting to say RDMA already solved the movement problem. RDMA solved the transport mechanics beautifully. It did not solve the end-to-end restore contract above transport. That contract includes deadline hints, useful-range tracking, destination arena identity, completion steering, and a way to express that one batch of blocks belongs to one logical decode-critical restore.

Another way to say it: the stack has enough mechanisms and not enough intent. What is missing is a place where the runtime can tell Linux what kind of movement this is.

That is the gap I am calling KV-Direct. Not necessarily as a brand-new monolithic subsystem on day one, but as a kernel-facing interface shape that lets the runtime describe KV movement in a way the stack can optimize coherently.

4. Proposal: KV-Direct as a batched, deadline-aware restore path

KV-Direct is a simple principle: make KV restore look less like thousands of tiny normal I/O operations and more like one inference restore transaction that carries enough structure for Linux to do the right thing.

restore this prefix
these blocks belong together
these ranges are decode-critical
this destination is a registered arena
this work can bypass the page cache
this completion should mean useful progress, not just byte arrival

Interface

Batched object movement

Submit one batch describing many KV blocks so the system can co-schedule, co-account, and co-complete them as one restore job.

Memory

Registered KV arenas

Give the kernel stable destination handles for CPU or GPU-visible arenas so repeated restore does not re-discover layout on every turn.

Policy

Deadline and usefulness hints

Differentiate decode-critical restore from background spill, speculative prefetch, and recompute-acceptable movement.

Storage

Page-cache bypass by default

SSD-backed KV should usually be treated as structured backing storage for a cache, not as ordinary file data hoping to benefit from generic buffered reuse.

The important point is not that Linux needs to learn transformer math. It does not. It only needs enough semantic structure to preserve locality, reduce completion noise, and avoid generic behavior that made sense for files but not for reusable KV state.

5. An interface sketch

This is not a final ABI proposal. It is a shape sketch for what a useful restore descriptor could look like if we wanted to build the simplest possible first-class contract between inference runtimes and Linux.

struct kv_move_desc {
    __u64 kv_hash;        // prefix or block identity
    __u64 src_handle;     // SSD extent, DRAM object, or remote memory handle
    __u64 dst_handle;     // registered GPU/CPU arena
    __u64 dst_offset;     // offset within the destination arena
    __u32 len;            // bytes in this KV block
    __u16 layer_id;       // optional layout hint
    __u16 block_id;       // logical ordering within the prefix
    __u32 flags;          // PREFETCH, DECODE_CRITICAL, SPILL, RECOMPUTE_OK
    __u64 deadline_ns;    // when decode wants it
    __u32 locality_hint;  // target NUMA or completion domain
    __u32 batch_id;       // group completion / accounting
};

Then the runtime submits many of them together, ideally through an io_uring-compatible model so the first prototype can reuse existing infrastructure rather than waiting for an all-new subsystem.

for (int i = 0; i < nblocks; i++) {
    batch[i].kv_hash       = prefix_blocks[i].hash;
    batch[i].src_handle    = prefix_blocks[i].backing_handle;
    batch[i].dst_handle    = gpu_kv_arena;
    batch[i].dst_offset    = prefix_blocks[i].dst_offset;
    batch[i].len           = prefix_blocks[i].len;
    batch[i].layer_id      = prefix_blocks[i].layer_id;
    batch[i].block_id      = prefix_blocks[i].block_id;
    batch[i].flags         = KV_F_DECODE_CRITICAL | KV_F_BYPASS_PAGE_CACHE;
    batch[i].deadline_ns   = decode_deadline_ns;
    batch[i].locality_hint = target_cpu_or_numa;
    batch[i].batch_id      = prefix_batch_id;
}

submit_kv_batch(ring, batch, nblocks);

The first implementation could be modest:

linked io_uring operations with registered buffers and direct I/O;
tracepoints that let us observe useful restore rather than just request completion;
completion steering and locality hints layered on top of existing block and scheduler plumbing.

Only later, if the win is real, would it make sense to carve out a true IORING_OP_KV_MOVE-style primitive or device-specific acceleration path.

6. What a realistic first Linux prototype should include

A lot of good systems ideas die because they jump straight to "new subsystem" before proving the bottleneck. I would start much smaller and much more ruthlessly measurable.

Prototype piece	Minimal version	Why it is enough to learn something
Submission model	Batched descriptors above io_uring	Shows whether descriptor aggregation and grouped completion already cut CPU overhead
Storage mode	Direct I/O into registered host buffers	Separates page-cache effects from genuine restore-path behavior
Completion policy	Consumer-affine CQ processing and wakeups	Tests whether locality is a first-order contributor to p99
Arena model	Long-lived destination handles with fixed layout metadata	Makes registration and placement reuse visible
Observability	Trace from device completion to userspace-ready to decode-start proxy	Lets us measure useful restore instead of admiring raw bandwidth

Failure mode to avoid: proving only that NVMe or RDMA can move bytes quickly in isolation. We already know that. The question is whether the full restore path gets decode-ready sooner with lower CPU cost and tighter tails.

7. What to measure if we actually want to prove the idea

A convincing benchmark would restore synthetic KV objects from NVMe or remote DRAM into a registered arena and compare four modes: normal buffered file I/O, mmap, direct I/O, and a batched KV-Direct-style path.

Metric	Why it matters
p50/p99 useful_restore_ns	Measures when the needed KV range becomes usable, not just when a read completes
cycles_per_gb_restored	Shows whether the host CPU becomes the hidden bottleneck
completion_cpu != consumer_cpu ratio	Captures locality misses that inflate wakeup and cache costs
descriptors_per_prefix	Exposes tiny-I/O amplification and completion noise
page_cache_bytes_polluted	Tests whether the system is pointlessly double-caching backing KV
decode-start proxy or TTFT proxy	Translates kernel wins into something the serving stack actually feels

tracepoints worth joining:
  block_rq_issue
  block_rq_complete
  nvme completion hooks
  io_uring CQE visibility
  sched_wakeup
  sched_switch
  DMA / registration events when available
  runtime-level "decode resumed" marker

derived signals:
  useful_restore_ns
  locality_miss_rate
  deferred_completion_rate
  registration_churn_per_prefix
  page_cache_side_effect_bytes

That last category is important. A surprising amount of I/O work looks fine until you ask what it did to the rest of the machine. KV is already a cache. If your restore path pollutes host memory and steals locality from other work, you have only moved the pain around.

8. Why this matters beyond Mooncake

I used Mooncake because it makes the problem legible, but the argument is broader than one project. Any future where inference systems spill, prefetch, disaggregate, or remotely restore KV state will run into the same question: how many times do we force Linux to rediscover the shape of a restore that the runtime already understands?

The systems opportunity here is modest but real. We do not need a magical kernel that understands LLMs. We need a kernel path that stops pretending every restore is just one more generic file or network operation.

9. Closing thought

Mooncake optimizes where KV lives. Linux can still optimize how KV moves.

That is the opportunity. Once KV cache becomes part of the distributed memory hierarchy, the OS data path matters again. The next win is not necessarily a better model or a better SSD. It may be a better restore contract between the inference runtime and Linux.

If distributed KV cache is the storage story of long-context inference, KV-Direct is the natural kernel sequel.