Systems for AI · Data Movement · Storage/Network/GPU

Why "Disk → RDMA → GPU" Is Still Fragmented Today

We have local direct-storage acceleration and network direct-memory acceleration, but we still do not have a universal GPU-native end-to-end storage fabric. The missing layer is orchestration.

Manish KLApril 2026~15 min readTechnical Essay
Ideal vs Real: The Fragmented Disk→RDMA→GPU Path IDEAL PATH NVMe Fabric / RDMA NIC / DPU GPU HBM descriptor-driven GPU-first, no host staging REAL FRAGMENTED PATH Host Seam: registrations · staging · protocol glue NVMe Host MR / Registration reg Pinned Buffer stage NIC / DPU handoff handoff GPU HBM missing primitive: remote storage namespace → fabric → GPU memory with no host seam in the middle

There is a seductive idea in modern AI infrastructure: data should move from storage to the GPU almost as if the entire system were one continuous fabric. No wasteful CPU copies. No host-memory bounce buffers. No awkward handoffs between storage, networking, and accelerator memory. Just descriptors, DMA, and compute.

That vision is directionally right. But the current reality is more fragmented than it first appears. Today, we have strong building blocks for local direct-storage access and for direct network-to-GPU transfers, but we still do not have a universal, GPU-native end-to-end storage fabric that makes disk → fabric → GPU feel like one seamless primitive.

The industry has powerful pieces of the puzzle, but the whole path is still stitched together across storage semantics, network semantics, GPU memory constraints, and host-side orchestration.

Control path vs. data path

The cleanest way to reason about these systems is to split them into two planes. The goal is not to eliminate the CPU completely — it is to keep the CPU on the control path and off the data path.

Control path

CPU, drivers, runtimes, and orchestration software set up registrations, descriptors, queue pairs, permissions, mappings, and transport policy. This is where software belongs.

Data path

The bytes themselves move via DMA, PCIe, RDMA, or storage transports. This is where you want the CPU to disappear as much as possible.

Technologies like GPUDirect Storage and GPUDirect RDMA are important precisely because they do not abolish system software — they reduce the need for the CPU to shepherd payload bytes through host memory.

What GPUDirect Storage actually solves

GPUDirect Storage is the strongest current answer for the local-box problem. It changes the path between storage and GPU memory so that data can avoid the classic CPU bounce-buffer route.

Older path:
NVMe SSD → host/page cache → user/runtime buffers → GPU

GDS-style path:
NVMe SSD → GPU memory  (direct DMA)

But there is a subtle architectural condition that matters a lot: GDS helps most when the GPU is the first and/or last meaningful agent to touch the data. If the CPU still needs to parse, inspect, transform, decompress, or reformat the payload before the GPU can use it, a lot of the benefit disappears.

GPUDirect Storage is not merely a transport trick. It is also a test of whether the application itself is GPU-native enough to keep the payload off the host path.

What GPUDirect RDMA actually solves

GPUDirect RDMA is the network-side analog. If GDS is fundamentally about storage ↔ GPU, then GPUDirect RDMA is fundamentally about NIC/DPU ↔ GPU.

Remote source → RDMA fabric → NIC / DPU → GPU memory

That matters enormously in distributed training and inference, because it lets network traffic land much closer to the GPU without repeatedly bouncing through CPU memory. The direct-network story in modern AI clusters is much stronger today precisely because the system increasingly wants the network endpoint and the accelerator to speak directly, while the CPU mostly programs setup, policy, and recovery.

The missing primitive

The real missing primitive is not "fast local storage" or "fast direct network access" in isolation. The missing primitive is:

remote NVMe / storage namespace
   ↓
fabric transport
   ↓
GPU memory on another host
(no ugly host-memory seam in the middle)

Today, the pieces exist but they are still not unified into a single, GPU-first abstraction. We have one family of technologies for local direct-storage access and another for direct network-to-GPU movement, but the overall system still tends to stitch them together through protocol assumptions and host-centric control glue.

Why NVMe-over-Fabrics is not the whole answer

NVMe over Fabrics is a critical part of the story because it extends NVMe semantics over network fabrics. But it is not, by itself, the full "remote storage directly into GPU memory" answer.

The reason is conceptual: storage-fabric transports were built in a host-centric world. The buffer and registration models often still assume host-visible, host-registered memory as the primary abstraction. That lineage matters — the protocol language is powerful, but still not inherently GPU-native from end to end.

What NVMe-oF gives you

Transported NVMe semantics across a fabric — essential for disaggregated storage.

What it doesn't give you

A universal guarantee that the destination abstraction is GPU-first rather than host-memory-first.

Why it matters

A host-centric buffer model naturally makes host staging or host registration the seam where fragmentation reappears.

Where staging sneaks back in

Even in highly optimized systems, staging and bounce regions still show up for recurring reasons:

Source of fragmentationWhat actually happensWhy it matters
Transport assumptionsProtocol and registration model still thinks first in terms of host memory regions or host-visible buffers.Host-side staging remains the natural rendezvous point.
Application behaviorCPU still parses, decompresses, frames, serializes, or converts payloads before the GPU can consume them.Lower layers may be excellent, but the end-to-end path is still not direct.
Control-plane admissibilityRegistrations, permissions, mappings, or peer compatibility are not ready on the fast path.System falls back into staging, bounce buffers, or extra handoff logic.

The real enemy is not just copies. It is copy-worthy seams in the architecture.

The three best cases today

Best Cases Today: What Works, What Doesn't ① BEST LOCAL-BOX CASE NVMe SSD GDS / cuFile GPU HBM strongest local answer GPU is first consumer ② BEST NETWORK-BOX CASE Remote src RDMA fabric NIC / DPU GPU HBM strongest network answer CPU mostly off data path ③ NOT-YET-UNIFIED (important gap) Remote NVMe NVMe-oF host orchestration ⚠ NIC/DPU GPU HBM abstractions still layered host staging persists gap: the whole path is not yet admissible end-to-end as a single GPU-native fabric primitive
Figure 1. The three best cases today. Local GDS and network GPUDirect RDMA are strong individually — the gap is the unified remote-storage-to-GPU path with no host seam.

1. Best local-box case

NVMe SSD → GDS/cuFile path → GPU memory → GPU compute. The strongest local answer available today, especially when the GPU is the first meaningful consumer of the bytes.

2. Best network-box case

Remote source → RDMA fabric → NIC/DPU → GPU memory. The strongest network-side story: once the data is already on the wire, the system can keep the CPU largely out of the payload path.

3. Not-yet-fully-unified case

Remote NVMe → NVMe-oF/RDMA transport → host-side orchestration/registration/possible staging → NIC/DPU/PCIe → GPU memory. This is the important gap — the stack is powerful, but the abstractions are still layered rather than fully unified.

Why DPUs matter so much

The DPU is strategically important because it can act as a data-path orchestrator at the edge of networking and storage, much closer to where bytes first enter or leave the system than the main CPU. A DPU can help terminate transport logic, participate in networking and storage control planes, program mappings, manage queues, and keep the host CPU out of the payload path.

Host-centric model

CPU remains the rendezvous point and repeatedly mediates transport, memory setup, and byte movement across subsystems.

DPU-assisted model

DPU absorbs more of the control and transport orchestration near the fabric edge, helping the payload stay closer to a direct path.

In that sense, the DPU is one of the missing architectural bridges between storage-native and GPU-native worlds.

Where CXL fits

CXL is not "disk to GPU over RDMA," but it matters because it changes the shape of the memory system around the problem. Once memory becomes more pooled, shared, and fabric-aware, the architecture has a much better chance of reducing rigid host-centric seams.

Systems where staging areas are pre-arranged fabric resources with better alignment between endpoints — rather than ad hoc host buffers — don't magically solve the problem, but they clearly push the industry toward a more composable, less bounce-heavy memory hierarchy.

The real systems insight

The hardest part here is not inventing yet another faster link. The hardest part is making the whole path admissible. The destination must be registered correctly. The transport semantics must align with the memory model. The consumer must be able to accept the payload in place. The software stack must avoid re-materializing the bytes into host-native forms.

The bottleneck is not storage or compute alone. It is data movement orchestration across the whole system.

Today we have local direct-storage acceleration and network direct-memory acceleration, but not yet a universal GPU-native end-to-end storage fabric. The missing layer is the orchestration that makes disk, fabric, NIC/DPU, and GPU memory behave like one continuous path instead of separate subsystems stitched together by host staging.