Beyond the CPU: Why Agentic Datacenters Need SmartNICs, DPUs, RDMA, and Zero-Copy

The conversation around AI infrastructure still defaults to one question: how much GPU do you have? Agentic systems force a better one: how much coordination overhead can your infrastructure absorb before the model itself stops being the bottleneck? Once that question matters, the stack around the accelerator starts to dominate architectural decisions.

The short thesis is simple: rising CPU demand is not the ending of the story. It is the signal that infrastructure work is leaking into the host, and that is exactly why SmartNICs, DPUs, RDMA, and zero-copy techniques become more valuable in agentic datacenters.

The plain-English version

Imagine a large restaurant.

The GPU is the main kitchen line. It is incredible at high-volume repetitive cooking.
The CPU is the floor manager. It coordinates tables, substitutions, inventory, payment, and staff timing.
The NIC is the front door.
A SmartNIC is a front-door team with its own specialists, so the manager does not personally inspect every delivery and every guest.
A DPU is a whole operations office built into the entrance. It can enforce policy, handle intake, route traffic, isolate problems, and keep the restaurant protected before work reaches the manager.

Traditional inference was usually one large order to the kitchen. Agentic AI is dozens of small decisions: planning, branching, checking tools, consulting memory, fetching data, spawning subtasks, retrying failures, and streaming results back. The kitchen still matters, but the restaurant can now stall because the hallway, front desk, and manager are overloaded.

Why CPUs become the first visible bottleneck

CPUs are the first subsystem people notice because they sit in the path of almost every coordination event. They terminate network connections, run application threads, host sidecars and proxies, batch work, maintain process state, perform auth, drive storage control paths, and wake the GPU runtime at exactly the right moment.

Too many copies

Packets land in host memory, get copied into kernel structures, copied into userspace buffers, and then moved again toward a GPU or another service.

Too many packets

Agent loops turn one model call into dozens of tiny RPCs, storage lookups, retries, and telemetry events, each with per-packet overhead.

Too much host policy

Virtual switching, overlay networking, service mesh work, TLS, and ACLs all consume premium host cycles.

Too much orchestration

Planning, state transitions, context assembly, and tool supervision create bursts of CPU-heavy control work around every inference.

That is why CPU utilization becomes an early warning sign. It is not just that the CPU is “important again.” It is that the host is accidentally doing too much infrastructure labor.

Classic inference vs. agentic request path

Classic requestClient sends prompt.

Model serviceBatch, infer, stream response.

User inputArrives through API ingress and auth.

PlannerChooses tools, sequence, and policy path.

ModelGenerates intermediate steps and decisions.

Tool callsSearch, DB, code execution, retrieval, memory.

VerifierGuardrails, sanity checks, retries, branch logic.

State storeLogs, memory, artifacts, checkpoints.

Agentic requests create far more control-plane activity, which amplifies CPU, network, and memory-path overhead around the model.

What SmartNICs and DPUs actually are

A traditional NIC is mainly a packet mover with DMA engines and queue machinery. A SmartNIC adds enough programmable acceleration to offload network-intensive functions such as flow steering, switching assistance, overlay handling, or firewall-like tasks. A DPU pushes farther: it becomes a first-class infrastructure processor with its own compute domain.

Component	Core job	Why it matters in agentic AI
CPU	General-purpose control, scheduling, syscalls, service logic	Agent runtimes amplify orchestration, retries, security, and state management.
GPU	Dense matrix and vector math	Still the main model-compute engine, but easily starved by slow data delivery.
SmartNIC	Network data-plane offload	Reduces host packet overhead and frees CPU cycles for application logic.
DPU	Infrastructure processing with embedded cores and offloads	Moves switching, isolation, storage, and policy enforcement closer to the wire.

Red Hat describes SmartNIC hardware offloading as diverting network tasks away from the CPU to a dedicated processor on a NIC, which reduces host CPU load and increases data-processing performance. NVIDIA describes BlueField DPU mode as an embedded Arm system that owns and controls NIC resources and the data path. Those are not cosmetic differences. They change where trust, policy, and packet work live in the system.

Three levels of infrastructure responsibility

NICPackets arrive, DMA runs, host handles almost all policy and virtualization.

SmartNICSome switching, steering, and packet-heavy functions move onto the card.

DPUAn embedded compute island owns more of the infrastructure control and data path.

The progression is not “faster NIC.” It is “more infrastructure work moved away from the host.”

Where RDMA and zero-copy fit

Once the problem is framed as byte movement rather than only model FLOPS, two ideas become central: avoid unnecessary copies and reduce host mediation on the fast path.

DMA

DMA, direct memory access, lets a device read or write host memory without the CPU copying every byte itself. Modern NICs and accelerators already depend on this. The CPU sets up rings and descriptors; the device moves the payload.

RDMA

RDMA extends the idea over the network. After setup and memory registration, the NIC can access remote memory with minimal CPU intervention. The Linux kernel documents this through the ib_uverbs userspace interface and the libibverbs userspace library from rdma-core.

Zero-copy

Zero-copy usually means avoiding extra hot-path copies between kernel space and userspace, or between one software layer and the next. It rarely means literally zero copies anywhere. The point is to keep the CPU from wasting bandwidth and cache on shuffling payloads around.

Why RDMA and zero-copy reduce host overhead

Network portPackets hit wire-speed ingress.

NIC or DPUSteering, queue choice, DMA, optional offloads.

Host memoryNormal path stores payloads in RAM.

Kernel pathsk_buff handling, protocol state, accounting, wakeups.

UserspaceService runtime, tokenizer, app logic, queues.

GPUConsumes data only after it has been delivered.

RDMA reduces CPU mediation. Zero-copy reduces duplicate movement. Offloads reduce the host’s packet and policy burden.

Linux exposes several relevant paths:

AF_XDP for high-performance packet delivery to userspace, including zero-copy mode when the driver supports it.
MSG_ZEROCOPY for copy avoidance on send-side socket operations.
io_uring zero-copy receive, which preserves the kernel TCP stack while placing payload data directly into userspace memory.
GPUDirect RDMA in supported topologies, where NICs and GPUs can exchange data more directly over PCIe.

How the data path is initialized in Linux

The Linux PCI driver sequence tells the bigger story cleanly: enable the device, request MMIO resources, set DMA masks, allocate coherent DMA control memory, register interrupts, initialize the device, then enable DMA and processing engines. That is the skeleton underneath every NIC, storage controller, and accelerator data path.

Two DMA modes matter especially here:

Coherent DMA for long-lived shared control structures such as descriptor rings.
Streaming DMA for transient packet or buffer payloads that are mapped for one transfer direction.

// illustrative only
#include <linux/module.h>
#include <linux/pci.h>
#include <linux/dma-mapping.h>

struct demo_ring {
    void *desc_cpu;
    dma_addr_t desc_dma;
    size_t size;
};

struct demo_dev {
    struct pci_dev *pdev;
    void __iomem *bar0;
    struct demo_ring tx;
};

static int demo_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
    struct demo_dev *d;
    int ret;

    d = devm_kzalloc(&pdev->dev, sizeof(*d), GFP_KERNEL);
    if (!d)
        return -ENOMEM;

    ret = pcim_enable_device(pdev);
    if (ret)
        return ret;

    pci_set_master(pdev);

    ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
    if (ret)
        return ret;

    ret = pci_request_region(pdev, 0, "demo_accel");
    if (ret)
        return ret;

    d->bar0 = pci_iomap(pdev, 0, 0);
    if (!d->bar0)
        return -ENOMEM;

    d->tx.size = PAGE_SIZE;
    d->tx.desc_cpu = dma_alloc_coherent(&pdev->dev,
                                        d->tx.size,
                                        &d->tx.desc_dma,
                                        GFP_KERNEL);
    if (!d->tx.desc_cpu)
        return -ENOMEM;

    /*
     * Real drivers now program descriptor base addresses into MMIO
     * registers, arm interrupts, and start the DMA engine.
     */
    return 0;
}

The point is not just that the CPU is involved. The point is that the CPU is mostly arranging movement and ownership so devices can execute the hot path efficiently. If too much of the hot path returns to the host, efficiency collapses.

Code-level examples

Userspace RDMA verbs

The Linux RDMA userspace path revolves around registering memory, creating queue pairs and completion queues, and posting work requests. After setup, the NIC fast path can operate with very low host CPU involvement.

// illustrative only
#include <infiniband/verbs.h>

struct ibv_device **list = ibv_get_device_list(NULL);
struct ibv_context *ctx = ibv_open_device(list[0]);
struct ibv_pd *pd = ibv_alloc_pd(ctx);
struct ibv_cq *cq = ibv_create_cq(ctx, 128, NULL, NULL, 0);

struct ibv_qp_init_attr qpia = {
    .send_cq = cq,
    .recv_cq = cq,
    .qp_type = IBV_QPT_RC,
    .cap = {
        .max_send_wr = 128,
        .max_recv_wr = 128,
        .max_send_sge = 1,
        .max_recv_sge = 1,
    },
};

struct ibv_qp *qp = ibv_create_qp(pd, &qpia);

char buf[4096];
struct ibv_mr *mr = ibv_reg_mr(pd, buf, sizeof(buf),
    IBV_ACCESS_LOCAL_WRITE |
    IBV_ACCESS_REMOTE_WRITE |
    IBV_ACCESS_REMOTE_READ);

struct ibv_sge sge = {
    .addr = (uintptr_t)buf,
    .length = sizeof(buf),
    .lkey = mr->lkey,
};

struct ibv_send_wr wr = {
    .wr_id = 1,
    .sg_list = &sge,
    .num_sge = 1,
    .opcode = IBV_WR_SEND,
    .send_flags = IBV_SEND_SIGNALED,
};

struct ibv_send_wr *bad = NULL;
ibv_post_send(qp, &wr, &bad);

Real code must transition the queue pair through RESET → INIT → RTR → RTS, connect to a peer, post receives, and poll completions carefully. But the example shows the mental model: register memory once, point the NIC at it, and let hardware move the bytes.

AF_XDP zero-copy packet path

AF_XDP sockets let an XDP program redirect packets into userspace memory. When the driver and hardware support it, binding with XDP_ZEROCOPY avoids copying packet data out to userspace.

// illustrative only
int fd = socket(AF_XDP, SOCK_RAW, 0);

/*
 * UMEM setup omitted:
 * - create fill/completion rings
 * - register UMEM
 * - create RX/TX rings
 */

struct sockaddr_xdp sxdp = {
    .sxdp_family = AF_XDP,
    .sxdp_ifindex = if_nametoindex("eth0"),
    .sxdp_queue_id = 0,
    .sxdp_flags = XDP_ZEROCOPY,
};

bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp));

/*
 * An XDP program loaded on the interface uses bpf_redirect_map()
 * to steer matching ingress packets into this socket's queue.
 */

That makes AF_XDP attractive when an application wants extreme packet rates without carrying the full burden of a DPDK-style stack.

Socket send with `MSG_ZEROCOPY`

Linux also supports copy avoidance on ordinary socket send calls. The important caveat from the kernel documentation is that this is a hint, not a guarantee. Applications must drain completion notifications from the error queue before reusing buffers safely.

// illustrative only
int one = 1;
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one));

ssize_t sent = send(fd, big_buf, big_len, MSG_ZEROCOPY);
if (sent < 0) {
    perror("send");
}

/*
 * Later:
 * poll() for POLLERR and drain MSG_ERRQUEUE with recvmsg()
 * to learn which sends are complete and safe to reuse.
 */

This is one of the clearest examples of the tradeoff around zero-copy. You avoid a hot-path copy, but you introduce page pinning, completion handling, and more complex buffer lifecycle rules.

io_uring zero-copy receive

The newer io_uring zero-copy receive path is especially interesting for AI services because it preserves the kernel TCP stack while still reducing kernel-to-userspace copy overhead. NICs need hardware support such as header/data split and proper flow steering, but when those conditions hold, the receive path becomes much leaner.

Where GPUDirect RDMA changes the picture

GPUDirect RDMA matters when a NIC and GPU can exchange data more directly over PCIe, cutting out some host bounce-buffer work. NVIDIA's documentation is careful about the constraints: topology matters, PCIe placement matters, and IOMMU configuration matters. The cleanest case is when the NIC and GPU share a favorable root-complex and switch topology.

Conventional path vs. direct NIC-to-GPU movement

ConventionalNIC DMA to host memory, host stack coordinates, GPU pull follows later.

Host-bypass leaningMore movement stays on device DMA paths and fewer bounce buffers touch the CPU.

GPUDirect-style pathSupported topologies allow tighter NIC-to-GPU transfer behavior.

The payoff is lower CPU pressure and better accelerator feed efficiency. The cost is operational complexity and stricter platform constraints.

What this means for AI infrastructure

There is a common misconception that AI performance is just a matter of buying more GPU FLOPS. That mental model only works when the rest of the system is quiet. Agentic systems are not quiet. They are networked control systems with storage, policy, orchestration, retries, and memory movement happening around each inference step.

The winning architecture therefore is not the one with the prettiest accelerator spec sheet. It is the one that keeps expensive host and accelerator resources focused on the work only they can do.

Use CPUs for control logic, not avoidable packet grunt work.
Use SmartNICs and DPUs to move repetitive infrastructure tasks closer to the wire.
Use RDMA when remote memory access and low CPU involvement truly matter.
Use zero-copy carefully where buffer sizes, driver support, and lifecycle complexity justify it.
Think in terms of whole-path efficiency, not isolated accelerator utilization.

The real progression is not GPU → CPU → done. It is GPU math, CPU orchestration, network steering, memory registration, storage locality, host isolation, and copy avoidance all interacting. Agentic AI does not just revive the CPU. It makes the entire datacenter data path visible again.

Sources

Primary references used for the technical details and API descriptions in this essay:

Beyond the CPU

The plain-English version

Why CPUs become the first visible bottleneck

Too many copies

Too many packets

Too much host policy

Too much orchestration

What SmartNICs and DPUs actually are

Where RDMA and zero-copy fit

DMA

RDMA

Zero-copy

How the data path is initialized in Linux

Code-level examples

Userspace RDMA verbs

AF_XDP zero-copy packet path

Socket send with MSG_ZEROCOPY

io_uring zero-copy receive

Where GPUDirect RDMA changes the picture

What this means for AI infrastructure

Sources

Socket send with `MSG_ZEROCOPY`