MANISH AI
← Writings RSS
Linux · vLLM · Mooncake · KV Cache

KV-Direct: Why Distributed KV Cache Needs a Linux Fast Path

Mooncake Store shows that KV cache is becoming cluster memory. The next bottleneck is not only where KV lives, but how efficiently Linux restores fragmented KV blocks across SSD, DRAM, RDMA, and GPU memory when decode is waiting.

17 min read Focus: io_uring / NVMe / RDMA / GPU locality Thesis: placement is solved before movement is

Where the next savings hide

Mooncake / vLLMprefix hash, placement, block reuse, prefill-decode decoupling
KV-Direct control hintsdeadline, locality, page-cache bypass, registered destination arena
Linux fast pathbatched io_uring, blk completion steering, NUMA-aware wakeups, direct storage path
Useful bytes in the right placeGPU HBM or CPU DRAM ready for decode, without extra detours

Distributed KV cache changes the problem shape. Once the industry stops treating KV as an ephemeral byproduct of one GPU server and starts treating it as a reusable cluster object, the next constraint is the operating system path that has to fetch, batch, steer, and land that object.

That is why this is now a Linux essay as much as a serving essay. In agentic inference, the expensive thing is often not recomputing one prompt once. It is repeatedly restoring long, fragmented prefix state under latency pressure while the model is waiting for the next useful block to become visible in the right destination arena.

Core claim: distributed KV cache solves placement, naming, and reuse. The next material win is a kernel fast path that treats KV restore as a first-class inference data movement problem rather than a pile of unrelated file reads and transport completions.

That distinction matters because decode does not care that an SSD completed a request in isolation. Decode cares that the useful KV range became available, on time, on the right CPU or GPU path, without extra page-cache churn, registration churn, or scheduler noise.

1. What Mooncake Store gets right

Mooncake Store is directionally right because it acknowledges something local KV caches hide poorly: long-context systems are no longer dominated by one request staying attached to one worker. Prefill can happen on one machine, decode on another, and future turns may revisit the same system prompt, tool trace, or retrieved corpus over and over.

Reuse

KV becomes an addressable object

Instead of assuming every worker has to reconstruct the same prefix, Mooncake identifies blocks by content-derived identity and makes them discoverable across nodes.

Movement

Data no longer has to stay local

Once KV state can live in remote DRAM, SSD, or another worker, the serving layer gets a new degree of freedom: fetch versus recompute becomes a runtime choice.

This is a big architectural change. It turns KV cache into part of the memory hierarchy. And once something becomes part of the memory hierarchy, the mundane things matter again: queue steering, completion locality, direct I/O, pinned memory lifetime, and whether the next block lands where the consumer can actually use it.

2. Why generic file and network I/O starts to lose

The first instinct is to say: "We already have read(), mmap(), io_uring, RDMA, and GPUDirect-style paths. Why invent another abstraction?" The answer is that all of those mechanisms are lower-level than the policy problem we now care about.

Generic Linux I/O assumes that if bytes eventually arrive, the contract is mostly satisfied. Distributed KV restore has a stricter contract:

  • many small or medium KV blocks belong to one logical prefix restore;
  • some of those blocks are decode-critical while others can lag;
  • the destination is often a long-lived structured arena, not an arbitrary buffer;
  • the useful metric is not request completion but decode readiness.
A normal I/O completion is not the same as decode readiness KV restore pays for every handoff between media completion and consumer usability. NVMe / DRAM block ready IRQ + blk path which CPU sees it? io_uring / CQE when is userspace told? Arena landing right NUMA or GPU path? Decode can consume the useful range this is the event that matters KV-Direct goal: collapse these boundaries into a batched, locality-aware path whose completions describe useful restore progress.
Figure 1. The hard part is not issuing I/O. It is preserving usefulness across the whole path.
LayerGeneric behaviorWhy it wastes time for KV restore
Page cacheOptimizes for broad file reuse and buffered ownershipKV is already a managed cache, so double-caching burns memory and muddies locality
Submission granularityMany individual reads or messagesPrefix restore becomes descriptor-heavy and completion-heavy before devices saturate
Completion semanticsRequest-orientedThe runtime wants "enough useful blocks are ready," not thousands of tiny unrelated done signals
Wakeup policyGeneric scheduler resumptionWrong-core wakeups and remote memory ownership widen p99 restore latency
RegistrationOften rediscovered or rebuilt at flow boundariesRepeated pin/map work becomes real CPU cost in stable serving pipelines

3. The missing layer is semantic, not just faster transport

It is tempting to say RDMA already solved the movement problem. RDMA solved the transport mechanics beautifully. It did not solve the end-to-end restore contract above transport. That contract includes deadline hints, useful-range tracking, destination arena identity, completion steering, and a way to express that one batch of blocks belongs to one logical decode-critical restore.

Another way to say it: the stack has enough mechanisms and not enough intent. What is missing is a place where the runtime can tell Linux what kind of movement this is.

That is the gap I am calling KV-Direct. Not necessarily as a brand-new monolithic subsystem on day one, but as a kernel-facing interface shape that lets the runtime describe KV movement in a way the stack can optimize coherently.

4. Proposal: KV-Direct as a batched, deadline-aware restore path

KV-Direct is a simple principle: make KV restore look less like thousands of tiny normal I/O operations and more like one inference restore transaction that carries enough structure for Linux to do the right thing.

restore this prefix
these blocks belong together
these ranges are decode-critical
this destination is a registered arena
this work can bypass the page cache
this completion should mean useful progress, not just byte arrival
Interface

Batched object movement

Submit one batch describing many KV blocks so the system can co-schedule, co-account, and co-complete them as one restore job.

Memory

Registered KV arenas

Give the kernel stable destination handles for CPU or GPU-visible arenas so repeated restore does not re-discover layout on every turn.

Policy

Deadline and usefulness hints

Differentiate decode-critical restore from background spill, speculative prefetch, and recompute-acceptable movement.

Storage

Page-cache bypass by default

SSD-backed KV should usually be treated as structured backing storage for a cache, not as ordinary file data hoping to benefit from generic buffered reuse.

The important point is not that Linux needs to learn transformer math. It does not. It only needs enough semantic structure to preserve locality, reduce completion noise, and avoid generic behavior that made sense for files but not for reusable KV state.

5. An interface sketch

This is not a final ABI proposal. It is a shape sketch for what a useful restore descriptor could look like if we wanted to build the simplest possible first-class contract between inference runtimes and Linux.

struct kv_move_desc {
    __u64 kv_hash;        // prefix or block identity
    __u64 src_handle;     // SSD extent, DRAM object, or remote memory handle
    __u64 dst_handle;     // registered GPU/CPU arena
    __u64 dst_offset;     // offset within the destination arena
    __u32 len;            // bytes in this KV block
    __u16 layer_id;       // optional layout hint
    __u16 block_id;       // logical ordering within the prefix
    __u32 flags;          // PREFETCH, DECODE_CRITICAL, SPILL, RECOMPUTE_OK
    __u64 deadline_ns;    // when decode wants it
    __u32 locality_hint;  // target NUMA or completion domain
    __u32 batch_id;       // group completion / accounting
};

Then the runtime submits many of them together, ideally through an io_uring-compatible model so the first prototype can reuse existing infrastructure rather than waiting for an all-new subsystem.

for (int i = 0; i < nblocks; i++) {
    batch[i].kv_hash       = prefix_blocks[i].hash;
    batch[i].src_handle    = prefix_blocks[i].backing_handle;
    batch[i].dst_handle    = gpu_kv_arena;
    batch[i].dst_offset    = prefix_blocks[i].dst_offset;
    batch[i].len           = prefix_blocks[i].len;
    batch[i].layer_id      = prefix_blocks[i].layer_id;
    batch[i].block_id      = prefix_blocks[i].block_id;
    batch[i].flags         = KV_F_DECODE_CRITICAL | KV_F_BYPASS_PAGE_CACHE;
    batch[i].deadline_ns   = decode_deadline_ns;
    batch[i].locality_hint = target_cpu_or_numa;
    batch[i].batch_id      = prefix_batch_id;
}

submit_kv_batch(ring, batch, nblocks);

The first implementation could be modest:

  • linked io_uring operations with registered buffers and direct I/O;
  • tracepoints that let us observe useful restore rather than just request completion;
  • completion steering and locality hints layered on top of existing block and scheduler plumbing.

Only later, if the win is real, would it make sense to carve out a true IORING_OP_KV_MOVE-style primitive or device-specific acceleration path.

6. What a realistic first Linux prototype should include

A lot of good systems ideas die because they jump straight to "new subsystem" before proving the bottleneck. I would start much smaller and much more ruthlessly measurable.

Prototype pieceMinimal versionWhy it is enough to learn something
Submission modelBatched descriptors above io_uringShows whether descriptor aggregation and grouped completion already cut CPU overhead
Storage modeDirect I/O into registered host buffersSeparates page-cache effects from genuine restore-path behavior
Completion policyConsumer-affine CQ processing and wakeupsTests whether locality is a first-order contributor to p99
Arena modelLong-lived destination handles with fixed layout metadataMakes registration and placement reuse visible
ObservabilityTrace from device completion to userspace-ready to decode-start proxyLets us measure useful restore instead of admiring raw bandwidth
Failure mode to avoid: proving only that NVMe or RDMA can move bytes quickly in isolation. We already know that. The question is whether the full restore path gets decode-ready sooner with lower CPU cost and tighter tails.

7. What to measure if we actually want to prove the idea

A convincing benchmark would restore synthetic KV objects from NVMe or remote DRAM into a registered arena and compare four modes: normal buffered file I/O, mmap, direct I/O, and a batched KV-Direct-style path.

MetricWhy it matters
p50/p99 useful_restore_nsMeasures when the needed KV range becomes usable, not just when a read completes
cycles_per_gb_restoredShows whether the host CPU becomes the hidden bottleneck
completion_cpu != consumer_cpu ratioCaptures locality misses that inflate wakeup and cache costs
descriptors_per_prefixExposes tiny-I/O amplification and completion noise
page_cache_bytes_pollutedTests whether the system is pointlessly double-caching backing KV
decode-start proxy or TTFT proxyTranslates kernel wins into something the serving stack actually feels
tracepoints worth joining:
  block_rq_issue
  block_rq_complete
  nvme completion hooks
  io_uring CQE visibility
  sched_wakeup
  sched_switch
  DMA / registration events when available
  runtime-level "decode resumed" marker

derived signals:
  useful_restore_ns
  locality_miss_rate
  deferred_completion_rate
  registration_churn_per_prefix
  page_cache_side_effect_bytes

That last category is important. A surprising amount of I/O work looks fine until you ask what it did to the rest of the machine. KV is already a cache. If your restore path pollutes host memory and steals locality from other work, you have only moved the pain around.

8. Why this matters beyond Mooncake

I used Mooncake because it makes the problem legible, but the argument is broader than one project. Any future where inference systems spill, prefetch, disaggregate, or remotely restore KV state will run into the same question: how many times do we force Linux to rediscover the shape of a restore that the runtime already understands?

The systems opportunity here is modest but real. We do not need a magical kernel that understands LLMs. We need a kernel path that stops pretending every restore is just one more generic file or network operation.

9. Closing thought

Mooncake optimizes where KV lives. Linux can still optimize how KV moves.

That is the opportunity. Once KV cache becomes part of the distributed memory hierarchy, the OS data path matters again. The next win is not necessarily a better model or a better SSD. It may be a better restore contract between the inference runtime and Linux.

If distributed KV cache is the storage story of long-context inference, KV-Direct is the natural kernel sequel.