RDMA in the Age of AI: Zero-Copy, KV Cache Transfer, and the New Glue Layer

Abstract

This post covers RDMA from past to present: what it is, how the verbs and memory-registration model work, why InfiniBand, iWARP, and RoCE mattered, why cloud RDMA and GPU-aware communication matter again today, how EFA, UCX, and NIXL fit into modern AI stacks, and why "RDMA exists" still does not mean true end-to-end zero-copy. The key modern insight: once RDMA improves the transport path, the next performance frontier is copy-boundary awareness — not more RDMA, but understanding where copies still hide and which software layer can route around them.

Why RDMA matters again

RDMA has always been about removing software overhead from the data path. That old motivation has not changed. What has changed is where the pressure now comes from. In classic HPC, the goal was tightly coupled distributed compute with low latency and high bandwidth. In modern AI systems, the pressure comes from massive model state, distributed serving, KV-cache transfer, and the cost of moving data across heterogeneous memory and compute domains.

AI has turned memory movement into the main event. Large models don't simply need more FLOPs — they need fewer wasteful copies, fewer kernel crossings, lower host intervention, and better control over where tensors land and how they travel.

RDMA reduces software in the path. AI makes the path itself strategic.

Origin

HPC clusters

Present

GPU inference

Core win

Fewer copies

Next frontier

Copy-boundary visibility

What RDMA actually is

RDMA — Remote Direct Memory Access — lets one machine access memory on another machine with much less CPU and operating-system involvement than ordinary socket-based data paths. The exact details differ by fabric and software stack, but the core idea is consistent: memory regions are registered, work requests are posted to queue pairs, and an RDMA-capable NIC executes the transfer using hardware offload and direct placement rather than ordinary buffered network I/O.

The practical mental model: register memory, hand descriptors to the RNIC, let the NIC move bytes directly — and avoid dragging the CPU through every hop.

Figure 1. The RDMA verbs model: queue pairs, registered memory regions, and hardware-executed transfers that bypass the kernel on the data path.

One-sided vs two-sided

RDMA systems usually distinguish between two-sided and one-sided operations. Two-sided send/receive still requires the remote side to have prepared a receive buffer. One-sided read/write lets one side directly read or write registered memory on the peer without an equivalent remote-side software action in the critical path — part of why RDMA has long been attractive for moving large state objects with lower synchronization overhead than ordinary message passing.

The historical arc

RDMA did not arrive because of LLMs. It emerged from decades of pressure in high-performance systems: distributed databases, storage targets, MPI-heavy clusters, low-latency trading infrastructure, and large scientific computing environments all wanted lower latency, lower CPU overhead, and better bandwidth efficiency.

Figure 2. RDMA's center of gravity has moved from classic HPC toward cloud AI, GPU communication, and disaggregated inference.

Technology	What it brought	Why it mattered
InfiniBand	native RDMA fabric, verbs model, mature HPC tooling	low-latency clusters and tightly coupled distributed compute
iWARP	RDMA over TCP/IP networks	extended RDMA semantics into conventional network environments
RoCE	RDMA over Ethernet	brought RDMA ideas into data-center Ethernet ecosystems
GPUDirect RDMA / UCX	GPU-aware data movement, modern comm frameworks	made RDMA relevant to distributed GPU workloads and AI pipelines

Zero-copy is the dream — not the whole truth

RDMA is often associated with "zero-copy," and that intuition is directionally right. If a registered memory region is transferred directly by the NIC into a remote registered region, then the path can avoid a lot of the extra copying that ordinary socket paths incur. But "RDMA exists" is not the same thing as "the full system is end-to-end zero-copy."

In real systems, extra copies can still appear at multiple points:

producer-side staging buffers before registration or handoff
metadata serialization and descriptor packing
host bounce buffers when memory cannot be registered or mapped directly
decode-side landing zones in host memory before device handoff
framework-level rematerialization after transport completes
small-message paths that choose lower-overhead buffered protocols instead of true zero-copy rendezvous

RDMA reduces copies. It does not make copy analysis optional.

The architecture of an RDMA path in AI serving

The AI-era RDMA question is not only "do we have an RNIC?" It's "where does the tensor begin, where is it registered, where does it land, and what software layers still touch it?" In disaggregated inference, that question becomes especially important because KV-cache movement sits directly in the serving critical path.

Figure 3. Copy boundaries in a real AI serving RDMA path. The transport middle is clean; the edges at both producer and consumer still hide multiple copy opportunities.

What changed in the cloud era

Historically, RDMA was associated with tightly managed HPC fabrics and special-purpose low-latency clusters. In the cloud era, providers introduced their own lower-latency, hardware-assisted communication models. On AWS, Elastic Fabric Adapter is exposed through libfabric; NVIDIA's inference-oriented transfer work has surfaced as NIXL with EFA support for disaggregated inference and KV-cache movement.

That's a modern signal: RDMA-style thinking is no longer limited to training collectives or MPI clusters. It's now explicit in the inference serving stack. And the moment AWS says "KV cache moves using RDMA," the technical question becomes sharper — it's no longer enough to say the network is good. The right question is: what is the copy depth of the end-to-end path?

Why RDMA matters specifically in AI

AI makes RDMA newly visible because the dominant objects are large and latency budgets are painful. Training has long cared about collective communication and GPU-to-GPU traffic. Inference adds a different pressure: KV cache is not just a background artifact — it's a serving primitive that has to move across systems in a path that directly affects throughput, inter-token latency, and tail behavior.

AI context	Why RDMA helps	What still needs attention
distributed training	lower CPU overhead, lower latency, better bandwidth efficiency for collectives	topology, NIC/GPU affinity, locality-aware placement
KV-cache transfer in inference	lower-copy path for large state movement between prefill and decode	landing zones, framework handoff, tail-latency sensitivity
memory tiering / offload	faster movement between memory or storage tiers	registration overhead, staging buffers, software orchestration

In AI infrastructure, RDMA is not only about low latency. It is about moving large live state objects through the system without wasting host cycles and memory bandwidth on unnecessary copies.

Why UCX, EFA, libfabric, and NIXL matter

Modern AI stacks rarely expose "raw verbs" directly as the user-facing abstraction. Instead, systems are built on communication frameworks and adapters that decide how to exploit the best available hardware resources. UCX is a high-performance communication framework that can target RDMA transports and GPU-aware paths. AWS's EFA is exposed through libfabric. NVIDIA's NIXL is an inference-oriented transfer library for point-to-point data movement in disaggregated serving environments.

Figure 4. The AI communication stack. The strategic moat is shifting upward — from who owns the transport to who owns the policy layer that knows which paths deserve trust for which workloads.

That layering matters because it leads to a subtle but important conclusion: even when an inference system "uses RDMA," the strategic software question may not only be which transport library is underneath. It may also be which layer decides how workloads should be routed across those transport paths, and how copy-heavy versus lower-copy paths should be identified.

The new frontier: copy-boundary awareness

This is where the conversation should go next. Once RDMA improves the transport path, the next performance frontier is not simply "more RDMA." It is understanding where copies still hide and how software can adapt around them.

That means asking questions like:

Is the producer-side memory already in a form the transport layer can use directly?
Does the transfer land in a host buffer before device handoff?
Does the framework rematerialize or repack tensors after the network step?
Do different candidate paths have different copy depth even if both claim "RDMA-based" transport?
Can an orchestration layer prefer lower-copy paths for strict workloads while allowing higher-copy but acceptable paths for tolerant workloads?

New systems question: once RDMA is present, what remains is not just a transport problem. It becomes a copy-boundary analysis problem and, eventually, a policy problem.

Past, present, and future

In the past, RDMA was a specialist tool for people who lived in low-latency clusters and HPC fabrics. In the present, AI has made it visible again because the software stack is suddenly full of large, live state objects that cannot afford ordinary copy-heavy paths. In the future, the most interesting systems may not simply advertise "we use RDMA." They may expose copy-boundary-aware control planes that know where data lands, how many copies the full path requires, and which workloads should be protected from paths with worse copy behavior.

The future is not only "RDMA everywhere." It is "copy-boundary visibility everywhere."

That is why RDMA deserves a fresh look in the age of AI. It is not a historical relic — it is a foundation. But foundations do not eliminate architectural work above them. Once transport becomes better, the next software moat may move upward: into the policy and orchestration layers that understand data placement, copy depth, and workload sensitivity.

rdma_in_the_age_of_ai_v2.html · Revised April 2026 · ← All writings