KV-Direct: Why Distributed KV Cache Needs a Linux Fast Path
Mooncake Store shows that KV cache is becoming cluster memory. The next bottleneck is not only where KV lives, but how efficiently Linux restores fragmented KV blocks across SSD, DRAM, RDMA, and GPU memory when decode is waiting.
Where the next savings hide
Distributed KV cache changes the problem shape. Once the industry stops treating KV as an ephemeral byproduct of one GPU server and starts treating it as a reusable cluster object, the next constraint is the operating system path that has to fetch, batch, steer, and land that object.
That is why this is now a Linux essay as much as a serving essay. In agentic inference, the expensive thing is often not recomputing one prompt once. It is repeatedly restoring long, fragmented prefix state under latency pressure while the model is waiting for the next useful block to become visible in the right destination arena.
That distinction matters because decode does not care that an SSD completed a request in isolation. Decode cares that the useful KV range became available, on time, on the right CPU or GPU path, without extra page-cache churn, registration churn, or scheduler noise.
1. What Mooncake Store gets right
Mooncake Store is directionally right because it acknowledges something local KV caches hide poorly: long-context systems are no longer dominated by one request staying attached to one worker. Prefill can happen on one machine, decode on another, and future turns may revisit the same system prompt, tool trace, or retrieved corpus over and over.
KV becomes an addressable object
Instead of assuming every worker has to reconstruct the same prefix, Mooncake identifies blocks by content-derived identity and makes them discoverable across nodes.
Data no longer has to stay local
Once KV state can live in remote DRAM, SSD, or another worker, the serving layer gets a new degree of freedom: fetch versus recompute becomes a runtime choice.
This is a big architectural change. It turns KV cache into part of the memory hierarchy. And once something becomes part of the memory hierarchy, the mundane things matter again: queue steering, completion locality, direct I/O, pinned memory lifetime, and whether the next block lands where the consumer can actually use it.
2. Why generic file and network I/O starts to lose
The first instinct is to say: "We already have read(), mmap(), io_uring, RDMA, and GPUDirect-style paths. Why invent another abstraction?" The answer is that all of those mechanisms are lower-level than the policy problem we now care about.
Generic Linux I/O assumes that if bytes eventually arrive, the contract is mostly satisfied. Distributed KV restore has a stricter contract:
- many small or medium KV blocks belong to one logical prefix restore;
- some of those blocks are decode-critical while others can lag;
- the destination is often a long-lived structured arena, not an arbitrary buffer;
- the useful metric is not request completion but decode readiness.
| Layer | Generic behavior | Why it wastes time for KV restore |
|---|---|---|
| Page cache | Optimizes for broad file reuse and buffered ownership | KV is already a managed cache, so double-caching burns memory and muddies locality |
| Submission granularity | Many individual reads or messages | Prefix restore becomes descriptor-heavy and completion-heavy before devices saturate |
| Completion semantics | Request-oriented | The runtime wants "enough useful blocks are ready," not thousands of tiny unrelated done signals |
| Wakeup policy | Generic scheduler resumption | Wrong-core wakeups and remote memory ownership widen p99 restore latency |
| Registration | Often rediscovered or rebuilt at flow boundaries | Repeated pin/map work becomes real CPU cost in stable serving pipelines |
3. The missing layer is semantic, not just faster transport
It is tempting to say RDMA already solved the movement problem. RDMA solved the transport mechanics beautifully. It did not solve the end-to-end restore contract above transport. That contract includes deadline hints, useful-range tracking, destination arena identity, completion steering, and a way to express that one batch of blocks belongs to one logical decode-critical restore.
That is the gap I am calling KV-Direct. Not necessarily as a brand-new monolithic subsystem on day one, but as a kernel-facing interface shape that lets the runtime describe KV movement in a way the stack can optimize coherently.
4. Proposal: KV-Direct as a batched, deadline-aware restore path
KV-Direct is a simple principle: make KV restore look less like thousands of tiny normal I/O operations and more like one inference restore transaction that carries enough structure for Linux to do the right thing.
restore this prefix
these blocks belong together
these ranges are decode-critical
this destination is a registered arena
this work can bypass the page cache
this completion should mean useful progress, not just byte arrival
Batched object movement
Submit one batch describing many KV blocks so the system can co-schedule, co-account, and co-complete them as one restore job.
Registered KV arenas
Give the kernel stable destination handles for CPU or GPU-visible arenas so repeated restore does not re-discover layout on every turn.
Deadline and usefulness hints
Differentiate decode-critical restore from background spill, speculative prefetch, and recompute-acceptable movement.
Page-cache bypass by default
SSD-backed KV should usually be treated as structured backing storage for a cache, not as ordinary file data hoping to benefit from generic buffered reuse.
The important point is not that Linux needs to learn transformer math. It does not. It only needs enough semantic structure to preserve locality, reduce completion noise, and avoid generic behavior that made sense for files but not for reusable KV state.
5. An interface sketch
This is not a final ABI proposal. It is a shape sketch for what a useful restore descriptor could look like if we wanted to build the simplest possible first-class contract between inference runtimes and Linux.
struct kv_move_desc {
__u64 kv_hash; // prefix or block identity
__u64 src_handle; // SSD extent, DRAM object, or remote memory handle
__u64 dst_handle; // registered GPU/CPU arena
__u64 dst_offset; // offset within the destination arena
__u32 len; // bytes in this KV block
__u16 layer_id; // optional layout hint
__u16 block_id; // logical ordering within the prefix
__u32 flags; // PREFETCH, DECODE_CRITICAL, SPILL, RECOMPUTE_OK
__u64 deadline_ns; // when decode wants it
__u32 locality_hint; // target NUMA or completion domain
__u32 batch_id; // group completion / accounting
};
Then the runtime submits many of them together, ideally through an io_uring-compatible model so the first prototype can reuse existing infrastructure rather than waiting for an all-new subsystem.
for (int i = 0; i < nblocks; i++) {
batch[i].kv_hash = prefix_blocks[i].hash;
batch[i].src_handle = prefix_blocks[i].backing_handle;
batch[i].dst_handle = gpu_kv_arena;
batch[i].dst_offset = prefix_blocks[i].dst_offset;
batch[i].len = prefix_blocks[i].len;
batch[i].layer_id = prefix_blocks[i].layer_id;
batch[i].block_id = prefix_blocks[i].block_id;
batch[i].flags = KV_F_DECODE_CRITICAL | KV_F_BYPASS_PAGE_CACHE;
batch[i].deadline_ns = decode_deadline_ns;
batch[i].locality_hint = target_cpu_or_numa;
batch[i].batch_id = prefix_batch_id;
}
submit_kv_batch(ring, batch, nblocks);
The first implementation could be modest:
- linked io_uring operations with registered buffers and direct I/O;
- tracepoints that let us observe useful restore rather than just request completion;
- completion steering and locality hints layered on top of existing block and scheduler plumbing.
Only later, if the win is real, would it make sense to carve out a true IORING_OP_KV_MOVE-style primitive or device-specific acceleration path.
6. What a realistic first Linux prototype should include
A lot of good systems ideas die because they jump straight to "new subsystem" before proving the bottleneck. I would start much smaller and much more ruthlessly measurable.
| Prototype piece | Minimal version | Why it is enough to learn something |
|---|---|---|
| Submission model | Batched descriptors above io_uring | Shows whether descriptor aggregation and grouped completion already cut CPU overhead |
| Storage mode | Direct I/O into registered host buffers | Separates page-cache effects from genuine restore-path behavior |
| Completion policy | Consumer-affine CQ processing and wakeups | Tests whether locality is a first-order contributor to p99 |
| Arena model | Long-lived destination handles with fixed layout metadata | Makes registration and placement reuse visible |
| Observability | Trace from device completion to userspace-ready to decode-start proxy | Lets us measure useful restore instead of admiring raw bandwidth |
7. What to measure if we actually want to prove the idea
A convincing benchmark would restore synthetic KV objects from NVMe or remote DRAM into a registered arena and compare four modes: normal buffered file I/O, mmap, direct I/O, and a batched KV-Direct-style path.
| Metric | Why it matters |
|---|---|
| p50/p99 useful_restore_ns | Measures when the needed KV range becomes usable, not just when a read completes |
| cycles_per_gb_restored | Shows whether the host CPU becomes the hidden bottleneck |
| completion_cpu != consumer_cpu ratio | Captures locality misses that inflate wakeup and cache costs |
| descriptors_per_prefix | Exposes tiny-I/O amplification and completion noise |
| page_cache_bytes_polluted | Tests whether the system is pointlessly double-caching backing KV |
| decode-start proxy or TTFT proxy | Translates kernel wins into something the serving stack actually feels |
tracepoints worth joining:
block_rq_issue
block_rq_complete
nvme completion hooks
io_uring CQE visibility
sched_wakeup
sched_switch
DMA / registration events when available
runtime-level "decode resumed" marker
derived signals:
useful_restore_ns
locality_miss_rate
deferred_completion_rate
registration_churn_per_prefix
page_cache_side_effect_bytes
That last category is important. A surprising amount of I/O work looks fine until you ask what it did to the rest of the machine. KV is already a cache. If your restore path pollutes host memory and steals locality from other work, you have only moved the pain around.
8. Why this matters beyond Mooncake
I used Mooncake because it makes the problem legible, but the argument is broader than one project. Any future where inference systems spill, prefetch, disaggregate, or remotely restore KV state will run into the same question: how many times do we force Linux to rediscover the shape of a restore that the runtime already understands?
The systems opportunity here is modest but real. We do not need a magical kernel that understands LLMs. We need a kernel path that stops pretending every restore is just one more generic file or network operation.
9. Closing thought
Mooncake optimizes where KV lives. Linux can still optimize how KV moves.
That is the opportunity. Once KV cache becomes part of the distributed memory hierarchy, the OS data path matters again. The next win is not necessarily a better model or a better SSD. It may be a better restore contract between the inference runtime and Linux.
If distributed KV cache is the storage story of long-context inference, KV-Direct is the natural kernel sequel.