The conversation around AI infrastructure still defaults to one question: how much GPU do you have? Agentic systems force a better one: how much coordination overhead can your infrastructure absorb before the model itself stops being the bottleneck? Once that question matters, the stack around the accelerator starts to dominate architectural decisions.
The plain-English version
Imagine a large restaurant.
- The
GPUis the main kitchen line. It is incredible at high-volume repetitive cooking. - The
CPUis the floor manager. It coordinates tables, substitutions, inventory, payment, and staff timing. - The
NICis the front door. - A
SmartNICis a front-door team with its own specialists, so the manager does not personally inspect every delivery and every guest. - A
DPUis a whole operations office built into the entrance. It can enforce policy, handle intake, route traffic, isolate problems, and keep the restaurant protected before work reaches the manager.
Traditional inference was usually one large order to the kitchen. Agentic AI is dozens of small decisions: planning, branching, checking tools, consulting memory, fetching data, spawning subtasks, retrying failures, and streaming results back. The kitchen still matters, but the restaurant can now stall because the hallway, front desk, and manager are overloaded.
Why CPUs become the first visible bottleneck
CPUs are the first subsystem people notice because they sit in the path of almost every coordination event. They terminate network connections, run application threads, host sidecars and proxies, batch work, maintain process state, perform auth, drive storage control paths, and wake the GPU runtime at exactly the right moment.
Too many copies
Packets land in host memory, get copied into kernel structures, copied into userspace buffers, and then moved again toward a GPU or another service.
Too many packets
Agent loops turn one model call into dozens of tiny RPCs, storage lookups, retries, and telemetry events, each with per-packet overhead.
Too much host policy
Virtual switching, overlay networking, service mesh work, TLS, and ACLs all consume premium host cycles.
Too much orchestration
Planning, state transitions, context assembly, and tool supervision create bursts of CPU-heavy control work around every inference.
That is why CPU utilization becomes an early warning sign. It is not just that the CPU is “important again.” It is that the host is accidentally doing too much infrastructure labor.
Classic inference vs. agentic request path
Agentic requests create far more control-plane activity, which amplifies CPU, network, and memory-path overhead around the model.
What SmartNICs and DPUs actually are
A traditional NIC is mainly a packet mover with DMA engines and queue machinery. A SmartNIC adds enough programmable acceleration to offload network-intensive functions such as flow steering, switching assistance, overlay handling, or firewall-like tasks. A DPU pushes farther: it becomes a first-class infrastructure processor with its own compute domain.
| Component | Core job | Why it matters in agentic AI |
|---|---|---|
| CPU | General-purpose control, scheduling, syscalls, service logic | Agent runtimes amplify orchestration, retries, security, and state management. |
| GPU | Dense matrix and vector math | Still the main model-compute engine, but easily starved by slow data delivery. |
| SmartNIC | Network data-plane offload | Reduces host packet overhead and frees CPU cycles for application logic. |
| DPU | Infrastructure processing with embedded cores and offloads | Moves switching, isolation, storage, and policy enforcement closer to the wire. |
Red Hat describes SmartNIC hardware offloading as diverting network tasks away from the CPU to a dedicated processor on a NIC, which reduces host CPU load and increases data-processing performance. NVIDIA describes BlueField DPU mode as an embedded Arm system that owns and controls NIC resources and the data path. Those are not cosmetic differences. They change where trust, policy, and packet work live in the system.
Three levels of infrastructure responsibility
The progression is not “faster NIC.” It is “more infrastructure work moved away from the host.”
Where RDMA and zero-copy fit
Once the problem is framed as byte movement rather than only model FLOPS, two ideas become central: avoid unnecessary copies and reduce host mediation on the fast path.
DMA
DMA, direct memory access, lets a device read or write host memory without the CPU copying every byte itself. Modern NICs and accelerators already depend on this. The CPU sets up rings and descriptors; the device moves the payload.
RDMA
RDMA extends the idea over the network. After setup and memory registration, the NIC can access remote memory with minimal CPU intervention. The Linux kernel documents this through the ib_uverbs userspace interface and the libibverbs userspace library from rdma-core.
Zero-copy
Zero-copy usually means avoiding extra hot-path copies between kernel space and userspace, or between one software layer and the next. It rarely means literally zero copies anywhere. The point is to keep the CPU from wasting bandwidth and cache on shuffling payloads around.
Why RDMA and zero-copy reduce host overhead
RDMA reduces CPU mediation. Zero-copy reduces duplicate movement. Offloads reduce the host’s packet and policy burden.
Linux exposes several relevant paths:
AF_XDPfor high-performance packet delivery to userspace, including zero-copy mode when the driver supports it.MSG_ZEROCOPYfor copy avoidance on send-side socket operations.io_uringzero-copy receive, which preserves the kernel TCP stack while placing payload data directly into userspace memory.GPUDirect RDMAin supported topologies, where NICs and GPUs can exchange data more directly over PCIe.
How the data path is initialized in Linux
The Linux PCI driver sequence tells the bigger story cleanly: enable the device, request MMIO resources, set DMA masks, allocate coherent DMA control memory, register interrupts, initialize the device, then enable DMA and processing engines. That is the skeleton underneath every NIC, storage controller, and accelerator data path.
Two DMA modes matter especially here:
- Coherent DMA for long-lived shared control structures such as descriptor rings.
- Streaming DMA for transient packet or buffer payloads that are mapped for one transfer direction.
// illustrative only
#include <linux/module.h>
#include <linux/pci.h>
#include <linux/dma-mapping.h>
struct demo_ring {
void *desc_cpu;
dma_addr_t desc_dma;
size_t size;
};
struct demo_dev {
struct pci_dev *pdev;
void __iomem *bar0;
struct demo_ring tx;
};
static int demo_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
struct demo_dev *d;
int ret;
d = devm_kzalloc(&pdev->dev, sizeof(*d), GFP_KERNEL);
if (!d)
return -ENOMEM;
ret = pcim_enable_device(pdev);
if (ret)
return ret;
pci_set_master(pdev);
ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
if (ret)
return ret;
ret = pci_request_region(pdev, 0, "demo_accel");
if (ret)
return ret;
d->bar0 = pci_iomap(pdev, 0, 0);
if (!d->bar0)
return -ENOMEM;
d->tx.size = PAGE_SIZE;
d->tx.desc_cpu = dma_alloc_coherent(&pdev->dev,
d->tx.size,
&d->tx.desc_dma,
GFP_KERNEL);
if (!d->tx.desc_cpu)
return -ENOMEM;
/*
* Real drivers now program descriptor base addresses into MMIO
* registers, arm interrupts, and start the DMA engine.
*/
return 0;
}
The point is not just that the CPU is involved. The point is that the CPU is mostly arranging movement and ownership so devices can execute the hot path efficiently. If too much of the hot path returns to the host, efficiency collapses.
Code-level examples
Userspace RDMA verbs
The Linux RDMA userspace path revolves around registering memory, creating queue pairs and completion queues, and posting work requests. After setup, the NIC fast path can operate with very low host CPU involvement.
// illustrative only
#include <infiniband/verbs.h>
struct ibv_device **list = ibv_get_device_list(NULL);
struct ibv_context *ctx = ibv_open_device(list[0]);
struct ibv_pd *pd = ibv_alloc_pd(ctx);
struct ibv_cq *cq = ibv_create_cq(ctx, 128, NULL, NULL, 0);
struct ibv_qp_init_attr qpia = {
.send_cq = cq,
.recv_cq = cq,
.qp_type = IBV_QPT_RC,
.cap = {
.max_send_wr = 128,
.max_recv_wr = 128,
.max_send_sge = 1,
.max_recv_sge = 1,
},
};
struct ibv_qp *qp = ibv_create_qp(pd, &qpia);
char buf[4096];
struct ibv_mr *mr = ibv_reg_mr(pd, buf, sizeof(buf),
IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ);
struct ibv_sge sge = {
.addr = (uintptr_t)buf,
.length = sizeof(buf),
.lkey = mr->lkey,
};
struct ibv_send_wr wr = {
.wr_id = 1,
.sg_list = &sge,
.num_sge = 1,
.opcode = IBV_WR_SEND,
.send_flags = IBV_SEND_SIGNALED,
};
struct ibv_send_wr *bad = NULL;
ibv_post_send(qp, &wr, &bad);
Real code must transition the queue pair through RESET → INIT → RTR → RTS, connect to a peer, post receives, and poll completions carefully. But the example shows the mental model: register memory once, point the NIC at it, and let hardware move the bytes.
AF_XDP zero-copy packet path
AF_XDP sockets let an XDP program redirect packets into userspace memory. When the driver and hardware support it, binding with XDP_ZEROCOPY avoids copying packet data out to userspace.
// illustrative only
int fd = socket(AF_XDP, SOCK_RAW, 0);
/*
* UMEM setup omitted:
* - create fill/completion rings
* - register UMEM
* - create RX/TX rings
*/
struct sockaddr_xdp sxdp = {
.sxdp_family = AF_XDP,
.sxdp_ifindex = if_nametoindex("eth0"),
.sxdp_queue_id = 0,
.sxdp_flags = XDP_ZEROCOPY,
};
bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp));
/*
* An XDP program loaded on the interface uses bpf_redirect_map()
* to steer matching ingress packets into this socket's queue.
*/
That makes AF_XDP attractive when an application wants extreme packet rates without carrying the full burden of a DPDK-style stack.
Socket send with MSG_ZEROCOPY
Linux also supports copy avoidance on ordinary socket send calls. The important caveat from the kernel documentation is that this is a hint, not a guarantee. Applications must drain completion notifications from the error queue before reusing buffers safely.
// illustrative only
int one = 1;
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one));
ssize_t sent = send(fd, big_buf, big_len, MSG_ZEROCOPY);
if (sent < 0) {
perror("send");
}
/*
* Later:
* poll() for POLLERR and drain MSG_ERRQUEUE with recvmsg()
* to learn which sends are complete and safe to reuse.
*/
This is one of the clearest examples of the tradeoff around zero-copy. You avoid a hot-path copy, but you introduce page pinning, completion handling, and more complex buffer lifecycle rules.
io_uring zero-copy receive
The newer io_uring zero-copy receive path is especially interesting for AI services because it preserves the kernel TCP stack while still reducing kernel-to-userspace copy overhead. NICs need hardware support such as header/data split and proper flow steering, but when those conditions hold, the receive path becomes much leaner.
Where GPUDirect RDMA changes the picture
GPUDirect RDMA matters when a NIC and GPU can exchange data more directly over PCIe, cutting out some host bounce-buffer work. NVIDIA's documentation is careful about the constraints: topology matters, PCIe placement matters, and IOMMU configuration matters. The cleanest case is when the NIC and GPU share a favorable root-complex and switch topology.
Conventional path vs. direct NIC-to-GPU movement
The payoff is lower CPU pressure and better accelerator feed efficiency. The cost is operational complexity and stricter platform constraints.
What this means for AI infrastructure
There is a common misconception that AI performance is just a matter of buying more GPU FLOPS. That mental model only works when the rest of the system is quiet. Agentic systems are not quiet. They are networked control systems with storage, policy, orchestration, retries, and memory movement happening around each inference step.
The winning architecture therefore is not the one with the prettiest accelerator spec sheet. It is the one that keeps expensive host and accelerator resources focused on the work only they can do.
- Use CPUs for control logic, not avoidable packet grunt work.
- Use SmartNICs and DPUs to move repetitive infrastructure tasks closer to the wire.
- Use RDMA when remote memory access and low CPU involvement truly matter.
- Use zero-copy carefully where buffer sizes, driver support, and lifecycle complexity justify it.
- Think in terms of whole-path efficiency, not isolated accelerator utilization.
Sources
Primary references used for the technical details and API descriptions in this essay:
- Linux kernel RDMA userspace verbs documentation
- Linux PCI driver documentation
- Linux DMA API guide
- Linux AF_XDP documentation
- Linux MSG_ZEROCOPY documentation
- Linux io_uring zero-copy receive documentation
- NVIDIA BlueField DPU operation docs
- NVIDIA BlueField modes of operation
- NVIDIA GPUDirect RDMA documentation
- NVIDIA GPU Operator GPUDirect RDMA overview
- Red Hat OpenShift SmartNIC hardware offloading docs