RDMA · AI Infrastructure · Zero-Copy · Disaggregated Inference

RDMA in the Age of AI

Remote Direct Memory Access is old enough to feel familiar and new enough to matter again. In the AI era, RDMA is no longer just an HPC detail — it sits in the critical path of GPU clusters, KV-cache transfer, and disaggregated inference. The real question isn't "is the link fast?" It's where copies still remain, and what software layer has the right to reason about them.

By Manish KL April 2026 ~19 min read Technical Essay
Abstract

This post covers RDMA from past to present: what it is, how the verbs and memory-registration model work, why InfiniBand, iWARP, and RoCE mattered, why cloud RDMA and GPU-aware communication matter again today, how EFA, UCX, and NIXL fit into modern AI stacks, and why "RDMA exists" still does not mean true end-to-end zero-copy. The key modern insight: once RDMA improves the transport path, the next performance frontier is copy-boundary awareness — not more RDMA, but understanding where copies still hide and which software layer can route around them.

Why RDMA matters again

RDMA has always been about removing software overhead from the data path. That old motivation has not changed. What has changed is where the pressure now comes from. In classic HPC, the goal was tightly coupled distributed compute with low latency and high bandwidth. In modern AI systems, the pressure comes from massive model state, distributed serving, KV-cache transfer, and the cost of moving data across heterogeneous memory and compute domains.

AI has turned memory movement into the main event. Large models don't simply need more FLOPs — they need fewer wasteful copies, fewer kernel crossings, lower host intervention, and better control over where tensors land and how they travel.

RDMA reduces software in the path. AI makes the path itself strategic.
Origin
HPC clusters
Present
GPU inference
Core win
Fewer copies
Next frontier
Copy-boundary visibility

What RDMA actually is

RDMA — Remote Direct Memory Access — lets one machine access memory on another machine with much less CPU and operating-system involvement than ordinary socket-based data paths. The exact details differ by fabric and software stack, but the core idea is consistent: memory regions are registered, work requests are posted to queue pairs, and an RDMA-capable NIC executes the transfer using hardware offload and direct placement rather than ordinary buffered network I/O.

The practical mental model: register memory, hand descriptors to the RNIC, let the NIC move bytes directly — and avoid dragging the CPU through every hop.
RDMA Core Mechanics: The Verbs Model queue pairs · memory registration · one-sided vs two-sided operations HOST A (Initiator) Application posts Work Requests Registered MR pinned memory region Send Queue (SQ) WQEs posted by CPU Receive Queue (RQ) pre-posted receive buffers Completion Queue (CQ) signals CPU when transfer done — minimal CPU involvement RNIC A hardware executes transfers — kernel bypass HOST B (Target) Registered MR destination buffer CPU/OS NOT in data path Queue Pair (QP) one-sided RDMA write lands directly → no remote CPU action RNIC B DMA writes directly into target memory RDMA fabric IB / RoCE / iWARP / EFA two-sided: both sides prepare buffers | one-sided RDMA read/write: remote CPU not involved in data path
Figure 1. The RDMA verbs model: queue pairs, registered memory regions, and hardware-executed transfers that bypass the kernel on the data path.

One-sided vs two-sided

RDMA systems usually distinguish between two-sided and one-sided operations. Two-sided send/receive still requires the remote side to have prepared a receive buffer. One-sided read/write lets one side directly read or write registered memory on the peer without an equivalent remote-side software action in the critical path — part of why RDMA has long been attractive for moving large state objects with lower synchronization overhead than ordinary message passing.

The historical arc

RDMA did not arrive because of LLMs. It emerged from decades of pressure in high-performance systems: distributed databases, storage targets, MPI-heavy clusters, low-latency trading infrastructure, and large scientific computing environments all wanted lower latency, lower CPU overhead, and better bandwidth efficiency.

RDMA: from HPC primitive to AI-critical infrastructure 1990s–2000s VIA initiative InfiniBand spec HPC roots 2000s–2010s iWARP / RoCE storage networks databases join 2010s GPUDirect RDMA GPU clusters UCX emerges 2020s cloud RDMA / EFA large-scale AI NIXL / NCCL now KV cache transfer disagg. inference copy-boundary ctrl center of gravity: HPC specialist tool → cloud AI critical path
Figure 2. RDMA's center of gravity has moved from classic HPC toward cloud AI, GPU communication, and disaggregated inference.
TechnologyWhat it broughtWhy it mattered
InfiniBandnative RDMA fabric, verbs model, mature HPC toolinglow-latency clusters and tightly coupled distributed compute
iWARPRDMA over TCP/IP networksextended RDMA semantics into conventional network environments
RoCERDMA over Ethernetbrought RDMA ideas into data-center Ethernet ecosystems
GPUDirect RDMA / UCXGPU-aware data movement, modern comm frameworksmade RDMA relevant to distributed GPU workloads and AI pipelines

Zero-copy is the dream — not the whole truth

RDMA is often associated with "zero-copy," and that intuition is directionally right. If a registered memory region is transferred directly by the NIC into a remote registered region, then the path can avoid a lot of the extra copying that ordinary socket paths incur. But "RDMA exists" is not the same thing as "the full system is end-to-end zero-copy."

In real systems, extra copies can still appear at multiple points:

RDMA reduces copies. It does not make copy analysis optional.

The architecture of an RDMA path in AI serving

The AI-era RDMA question is not only "do we have an RNIC?" It's "where does the tensor begin, where is it registered, where does it land, and what software layers still touch it?" In disaggregated inference, that question becomes especially important because KV-cache movement sits directly in the serving critical path.

RDMA in AI Serving: Where Copies Actually Hide RDMA improves transport · copy boundaries live at the edges, not the middle PRODUCER (Prefill Node) KV Origin Memory GPU HBM / device memory Registration / Staging possible bounce buffer ⚠ Descriptor Packing metadata serialization RNIC (Producer) HW-offloaded transfer initiation RDMA Transport Path InfiniBand · RoCE · iWARP · EFA libfabric / UCX / NIXL lower-copy movement ✓ but not always end-to-end ↕ CONSUMER (Decode Node) RNIC (Consumer) DMA into host memory Landing Zone host-side bounce buffer ⚠ Framework Handoff possible rematerialization ⚠ Decode GPU Memory final destination copy boundary ⚠ Registration · staging buffers · bounce buffers · landing zones · framework rematerialization The key insight: copies survive at the edges even when the transport path is RDMA-clean
Figure 3. Copy boundaries in a real AI serving RDMA path. The transport middle is clean; the edges at both producer and consumer still hide multiple copy opportunities.

What changed in the cloud era

Historically, RDMA was associated with tightly managed HPC fabrics and special-purpose low-latency clusters. In the cloud era, providers introduced their own lower-latency, hardware-assisted communication models. On AWS, Elastic Fabric Adapter is exposed through libfabric; NVIDIA's inference-oriented transfer work has surfaced as NIXL with EFA support for disaggregated inference and KV-cache movement.

That's a modern signal: RDMA-style thinking is no longer limited to training collectives or MPI clusters. It's now explicit in the inference serving stack. And the moment AWS says "KV cache moves using RDMA," the technical question becomes sharper — it's no longer enough to say the network is good. The right question is: what is the copy depth of the end-to-end path?

Why RDMA matters specifically in AI

AI makes RDMA newly visible because the dominant objects are large and latency budgets are painful. Training has long cared about collective communication and GPU-to-GPU traffic. Inference adds a different pressure: KV cache is not just a background artifact — it's a serving primitive that has to move across systems in a path that directly affects throughput, inter-token latency, and tail behavior.

AI contextWhy RDMA helpsWhat still needs attention
distributed traininglower CPU overhead, lower latency, better bandwidth efficiency for collectivestopology, NIC/GPU affinity, locality-aware placement
KV-cache transfer in inferencelower-copy path for large state movement between prefill and decodelanding zones, framework handoff, tail-latency sensitivity
memory tiering / offloadfaster movement between memory or storage tiersregistration overhead, staging buffers, software orchestration
In AI infrastructure, RDMA is not only about low latency. It is about moving large live state objects through the system without wasting host cycles and memory bandwidth on unnecessary copies.

Why UCX, EFA, libfabric, and NIXL matter

Modern AI stacks rarely expose "raw verbs" directly as the user-facing abstraction. Instead, systems are built on communication frameworks and adapters that decide how to exploit the best available hardware resources. UCX is a high-performance communication framework that can target RDMA transports and GPU-aware paths. AWS's EFA is exposed through libfabric. NVIDIA's NIXL is an inference-oriented transfer library for point-to-point data movement in disaggregated serving environments.

The AI Communication Stack: Layers Above Raw RDMA where the strategic software question lives — above transport, not at it Application / Inference Framework vLLM · TensorRT-LLM · SGLang · custom serving Orchestration / Policy Layer NIXL · workload-aware routing · copy-boundary analysis ← strategic moat Communication Framework UCX · libfabric · NCCL · MPI Transport Layer RoCE v2 · InfiniBand · iWARP · EFA Hardware / RNIC ConnectX · EFA NIC · custom silicon next moat
Figure 4. The AI communication stack. The strategic moat is shifting upward — from who owns the transport to who owns the policy layer that knows which paths deserve trust for which workloads.

That layering matters because it leads to a subtle but important conclusion: even when an inference system "uses RDMA," the strategic software question may not only be which transport library is underneath. It may also be which layer decides how workloads should be routed across those transport paths, and how copy-heavy versus lower-copy paths should be identified.

The new frontier: copy-boundary awareness

This is where the conversation should go next. Once RDMA improves the transport path, the next performance frontier is not simply "more RDMA." It is understanding where copies still hide and how software can adapt around them.

That means asking questions like:

New systems question: once RDMA is present, what remains is not just a transport problem. It becomes a copy-boundary analysis problem and, eventually, a policy problem.

Past, present, and future

In the past, RDMA was a specialist tool for people who lived in low-latency clusters and HPC fabrics. In the present, AI has made it visible again because the software stack is suddenly full of large, live state objects that cannot afford ordinary copy-heavy paths. In the future, the most interesting systems may not simply advertise "we use RDMA." They may expose copy-boundary-aware control planes that know where data lands, how many copies the full path requires, and which workloads should be protected from paths with worse copy behavior.

The future is not only "RDMA everywhere." It is "copy-boundary visibility everywhere."

That is why RDMA deserves a fresh look in the age of AI. It is not a historical relic — it is a foundation. But foundations do not eliminate architectural work above them. Once transport becomes better, the next software moat may move upward: into the policy and orchestration layers that understand data placement, copy depth, and workload sensitivity.