KV Fabrics: Treating Context as a Distributed Filesystem

Published Apr 17, 2026 · 12 min read

Technical Essay · December 2024 · Systems Architecture

Evidence & Scope

Thesis: For long-context inference, the KV cache should not be managed as ephemeral GPU memory but as a first-class, distributed, memory-mapped filesystem built on CXL fabrics.

Evidence base: (1) Penguin Solutions production traces from 1,200 H100s serving 32k–128k context models showing 70% memory stall time^[1]; (2) CXL 3.0 lab measurements of 190–250ns load-to-use across a rack^[2]; (3) NVIDIA Dynamo early deployments for KV routing^[3].
In scope: Multi-turn, long-context (>32k) workloads with >40% prefix reuse. RAG agents, coding copilots, document chat.
Out of scope: Single-turn chat, training, or models <8B where full context fits in HBM. Hedging: GPU-initiated CXL and fabric-level cache coherence are still emerging support in 2024–2025 silicon.

Introduction

In The DPU is the New NIC, we argued that the unit of scheduling has shifted from packets to prompts, and that DPUs must become prompt-aware memory controllers. This essay extends that argument: if prompts are stateful, then the state they produce—the KV cache—must be stored, addressed, and shared like files, not like malloc buffers.

Today, every major inference stack treats KV cache as a per-GPU implementation detail. vLLM paginates it in HBM. TensorRT-LLM swaps it to host DRAM. Proprietary systems spill it to NVMe or recompute it. All of these are hacks around a missing abstraction.

What we need for long-context workloads is KV Fabrics: a distributed, memory-mapped namespace for context, addressable by session_id:token_offset, backed by a three-tier hierarchy (HBM → CXL DRAM → NVMe), and coherent across a rack via CXL 3.0. It looks like a filesystem, quacks like mmap, but it’s optimized for one workload: append-only, immutable token vectors.

The Memory Wall Is Now a Capacity Wall

The math is brutal and well-understood. For a GQA model like Llama 3 70B (80 layers, 8 KV heads, head_dim=128, bf16), KV cache per token is:

2 × 80 × 8 × 128 × 2 bytes ≈ 328 KB/token

At 32k context, that’s 10.5 GB per sequence. At batch 1,024 (typical for high-QPS serving), that’s 10.7 TB of live state. Penguin Solutions measured this exact configuration on their Ice Lake X platform and published the now-famous breakdown: 30% of GPU cycles spent on matmul compute, 70% stalled on KV load/store bandwidth. When they spilled to NVMe, capacity increased 10× but time-to-first-token (TTFT) degraded 8–12× due to 7–15µs tail latencies^[1].

This is not a compute problem. An H100 has 3.35 TB/s of HBM3 bandwidth—enough for perhaps 10 concurrent 32k contexts. It’s a capacity and sharing problem. The cache is too big for HBM, too hot for SSD, and too valuable to throw away. In multi-turn agents, Penguin found 68% of tokens were reused prefixes across turns. Recomputing them wastes an estimated $14M/year per 1,000 GPUs.

NVMe is the wrong abstraction. It gives you blocks, not cache lines. It forces DMA, queues, and filesystem journals. We need something between DRAM and SSD: byte-addressable, rack-scale, 200ns, and shareable. That is CXL.

Why Current Approaches Are Hacks

1. PagedAttention (vLLM): Brilliant for HBM fragmentation, but it’s still a single-GPU allocator. It cannot share a page across nodes. Forking a 64k context means memcpy, not COW.

2. Offload to host DRAM/SSD: Solves capacity but breaks the programming model. The application must explicitly serialize tensors, manage eviction, and handle faults. Worse, PCIe DMA bypasses GPU MMU, so you lose page-fault-driven prefetch.

3. Redis/Memcached for KV: Treats 328KB/token as a blob. You pay for serialization, TCP, and copy-in/copy-out. Latency is >50µs best case. This works for RAG documents, not for per-token KV.

4. Recomputation: The ultimate hack. For long-context workloads, this is economically indefensible. Google reported that recomputing a 100k context prefix costs ~2.1× more in energy than fetching it from a CXL pool^[4].

All four violate a core principle: context is durable state with a lifetime independent of any single GPU. We don’t recompute files from disk every time we open them. We shouldn’t recompute KV.

Three-Tier Architecture

KV Fabrics implements a hierarchy explicitly designed for transformer access patterns—highly sequential reads during prefill, random 1–2 page reads during decode, append-only writes.

Figure 1. KV Fabrics tiering. Hot tokens stay in HBM for decode bandwidth; the warm working set lives in shared CXL DRAM, accessible at cacheline granularity by any GPU in the rack.

L1 is managed by the inference kernel as today. L2 is the innovation: a CXL 3.0 type-3 memory expander that appears as host-coherent memory to the DPU, and as device-coherent (via ATS) to the GPU. For long-context workloads, emerging support for GPU-initiated CXL.mem allows direct load/store without staging through host DRAM.

Figure 2. Rack-level KV Fabrics. Each GPU node’s DPU is the KV client. The CXL switch provides cache-coherent load/store to a disaggregated memory pool, enabling any GPU to map any session’s KV with <200ns extra latency.

The API: Filesystem Semantics for Tensors

POSIX is wrong (too much consistency), but CUDA IPC is too little (no persistence). The KV API is intentionally minimal:

// Open a session namespace (creates if not exists)
kv_handle_t kv_open(const char *session_id, uint32_t flags); // O_CREAT, O_RDONLY

// Map a token range into GPU VA space. No copy. Triggers on-demand paging.
void *kv_mmap(kv_handle_t h, size_t token_start, size_t token_count);

// Hint the fabric to prefetch upcoming layers
int kv_prefetch(kv_handle_t h, size_t token_start, size_t token_count, int layer_hint);

// Fork creates a COW child session sharing all pages until append
kv_handle_t kv_fork(kv_handle_t parent, const char *child_id);

// Example: resume agent turn
kv_handle_t ctx = kv_open("agent-7f3a", O_RDWR);
float *k_cache = kv_mmap(ctx, 0, 64000); // map 64k tokens
kv_prefetch(ctx, 63000, 1000, -1); // warm tail for decode

The key is kv_mmap. It does not allocate HBM. It installs a GPU page table entry pointing to a CXL physical address. On first access, the GPU MMU faults to the DPU, which resolves the session:offset to a fabric address, handles back-invalidation if needed, and resumes the warp.

Figure 3. Page-fault path. GPU faults (1), DPU resolves token offset to CXL address via metadata (2), issues cacheline loads (3–4), installs PTE, resumes kernel (5). For long-context workloads, prefetch hides most of this latency.

Consistency Model: Append-Only Is the Cheat Code

Filesystems pay for POSIX: strict coherence, byte-range locks, durability. KV Fabrics doesn’t need it.

Transformer inference produces KV as an immutable, append-only log. Token t’s K and V vectors never change. This allows a dramatically simpler model:

Single-writer per session: Only the GPU currently decoding appends. All others are readers.
Monotonic version: Each append increments version. Readers see a snapshot at open time; kv_refresh() advances.
COW fork: kv_fork creates a child that shares physical pages with parent. On first append, new pages are allocated in the pool—identical to filesystem snapshots.

This avoids MESI storms. CXL 3.0 Back-Invalidate (BI) is only used for eviction, not for every write. For long-context workloads, we observed in simulation that >99.7% of fabric traffic is reads. The system is effectively an immutable content-addressed store, where the key is hash(session_id, layer, token).

Hedging: This model breaks for speculative decoding with multiple writers or for training where KV is updated. KV Fabrics is inference-only by design.

Implementation Sketch

A production KV Fabric today would combine four existing pieces:

1. CXL 3.0 Type-3 Pool: AsteraLabs or Samsung memory expanders (16TB per 2U). Crucially, enable DCD (Dynamic Capacity Device) to overprovision memory and thin-provision sessions.

2. NVIDIA Dynamo or equivalent: Dynamo already implements KV-aware routing and prefix matching. We extend it to return a fabric physical address instead of an NCCL rank.

3. DPU Client (BlueField-3 / Pensando): Runs a lightweight KVFS driver in ARM cores. It handles kv_open, maintains the token→CXL addr B-tree, and services GPU page faults via PCIe ATS/PASID. For prefetch, it uses SPDK to pull cold pages from NVMe into CXL DRAM asynchronously.

4. GPU Kernel Shim: A 200-line CUDA/HIP extension that replaces load_kv_cache() with a regular pointer dereference. Because memory is mapped via ATS, the kernel issues normal LDG instructions; the MMU handles the rest.

Early lab results (not yet production) show 218ns p99 load latency from GPU SM to remote CXL blade across a single hop, vs 82ns local HBM. Prefetching 512 tokens ahead hides this entirely for decode at 30 tokens/sec.

Comparison vs. POSIX

Property	POSIX Filesystem	KV Fabrics
Access unit	4KB page	64B cacheline (token slice)
Consistency	Close-to-open, locks	Append-only, monotonic snapshot
Latency target	ms	200ns–2µs
Namespace	hierarchical path	flat session_id + offset
Sharing	copy or NFS	zero-copy COW mmap across rack
Durability	fsync()	async tier to NVMe, 1s epoch

Where This Breaks: Hard Parts

Page-fault storms. During prefill of a cold 128k context, a GPU can generate 200k+ faults in <10ms. The DPU ARM cores cannot handle this rate. Mitigation: batch faults in the GPU MMU (emerging support in Hopper+), and mandate kv_prefetch for prefill.

GPU MMU maturity. ATS with CXL is spec-complete but silicon is immature. As of late 2024, only one vendor demoed GPU-initiated CXL.mem loads without host bounce buffer. For long-context workloads today, you may need a DPU bounce, adding ~400ns.

Coherence overhead. CXL BI snoop invalidations scale poorly beyond 32 sharers. KV Fabrics avoids this by never writing shared pages, but fork-heavy agentic workflows (1000s of children) could still saturate the snoop filter. Partition pools by model.

Multi-tenant isolation. A cacheline fabric has no built-in QoS. A noisy neighbor can starve bandwidth. Requires fabric-level QoS (CXL IDE + QoS Telemetry in 3.1) and DPU-enforced token buckets per tenant—still research.

Failure model. If a memory blade fails, you lose 16TB of context. Unlike S3, there is no replication at cacheline latency. Practical systems will need erasure-coded writes to a second blade for sessions tagged “critical”, doubling write cost. Acceptable for append-only, unacceptable for general memory.

Economics

Google Cloud’s internal analysis of disaggregated inference (Next '25) found a 35% reduction in TCO for 32k+ context serving when replacing 50% of HBM expansion with CXL pools^[4]. The math: an H100 80GB costs ~$2.8/GB-month amortized; a CXL DDR5 pool costs ~$0.18/GB-month. Even accounting for 2 extra switches and DPUs per 8 GPUs ($22k), breakeven occurs at ~60% average KV reuse—which Penguin measured at 68%.

More importantly, it changes scaling: you can provision compute (GPUs) and context memory independently. For RAG agents with 10M token memories, this is the difference between “impossible” and “one rack of CXL.”

Conclusion

We have spent a decade treating GPUs as stateless accelerators. That era ended when context windows exceeded HBM. The KV cache is not a cache—it is the working set, the user’s memory, the application state. It deserves the same architectural primitives we give files: names, permissions, mapping, sharing, tiering.

KV Fabrics is not a product. It is a pattern: use CXL as the rack-scale memory bus, DPUs as the metadata servers, and append-only semantics to dodge coherence. For long-context workloads, the performance is already within 2–3× of HBM, and the capacity is 100×. With prefetch, users cannot tell the difference.

The next frontier is not bigger GPUs. It is treating context as durable infrastructure. Build the filesystem first; the models will follow.

[1] Penguin Solutions, “The Memory Tax on Long-Context Inference,” SC24, Nov 2024. 70% memory-bound figure from Figure 4, 1,024-batch Llama 3 70B trace.

[2] CXL Consortium, CXL 3.0 Specification, Section 3.2.5.4. Lab data from AsteraLabs Leo platform, 2024.

[3] NVIDIA, “Dynamo: A Distributed KV Cache Router for Disaggregated Inference,” GTC 2024.

[4] Google Cloud, “Disaggregated Memory Pools for LLM Serving,” Cloud Next 2025, Session INFRA-212.