M
SYSTEMS DEEP DIVE · INFERENCE TAX SERIES
PREVIOUSLY: The RDMA Tax

The PCIe Tax: How Host-Staged Networking Steals Half Your GPU Bandwidth

We buy H100s for 3.35 TB/s of HBM3e, then strangle them with a 25-year-old I/O model that forces every byte through the CPU. The fix is not faster PCIe — it is zero crossings.

M
Manish KL
AI Infrastructure · Systems Architecture

Field Evidence

Host-staged networking consumes 4–6 PCIe crossings per request, cutting effective GPU bandwidth by 50–60%. Bypassing the host with GPUDirect RDMA restores full wire speed — with gains ranging from 25% on dense models to 110% on MoE workloads.

PCIe Gen5 x16 = 63 GB/s raw, but host-staged achieves only 24–26 GB/s effective [2]
Harvest (OSDI'24): +110% throughput for MoE with DPU bypass [1]
GPUNetIO cuts p99 NIC→GPU latency from 62µs → 18µs [3]
73% of LLM transfers are <128KB — exactly where PCIe tax dominates [4]

The dominant narrative in AI infrastructure is about FLOPS. We buy H100s for 3.35 TB/s of HBM3e bandwidth, connect them with 900 GB/s NVLink, and then install them in servers where every inbound token request must traverse the PCIe fabric four times before a single multiply-accumulate happens.

This is the PCIe Tax — and it is the largest unaccounted cost in modern inference clusters.

In a typical HGX H100 node with dual ConnectX-7 NICs, the data path is medieval: a packet arrives at the NIC, the NIC DMAs to host DRAM (crossing #1), the CPU wakes on interrupt and copies to a pinned buffer (crossings #2–3 through the memory controller), then the GPU driver initiates DMA from host to device (crossing #4). Egress reverses the process. You pay for 63 GB/s of PCIe Gen5 and receive less than 26 GB/s of useful payload.

We traced this in production on a vLLM cluster serving Llama-3 70B. With TensorRT-LLM, NCCL, and pinned memory, the theoretical NIC→GPU bandwidth was 50 GB/s. nvidia-smi topo -m showed everything on the same switch. Yet ib_write_bw --use_cuda plateaued at 24.7 GB/s. p99 TTFT spiked precisely when CPU utilization crossed 40% — not GPU utilization.

This essay quantifies that tax. We walk through the 4–6 crossing problem, do the PCIe Gen5 math, examine data from Harvest and Puget Systems, and show the architecture and economics of a true zero-copy path using GPUDirect RDMA and NVIDIA BlueField DPUs with DOCA GPUNetIO.

The 4–6 Crossings Problem

The host is not a neutral observer. Every staged copy adds latency, consumes memory bandwidth, and pollutes CPU caches. Trace a single 64KB prefill chunk arriving over RoCEv2 in a standard Kubernetes pod:

  1. 1
    NIC → Host DRAM: ConnectX-7 writes 64KB via PCIe DMA to a receive queue. ~700ns + 1 PCIe traversal
  2. 2
    CPU cache fill: IRQ wakes a core; driver reads the completion descriptor. Cache line pulled from DRAM, evicting application data. ~400ns, L3 pollution
  3. 3
    Kernel → Userspace: memcpy or DPDK rte_pktmbuf. Another DRAM read/write round trip. ~1.2µs for 64KB
  4. 4
    Userspace → Pinned buffer: vLLM copies to CUDA-pinned memory for async DMA. ~1.1µs
  5. 5
    Host DRAM → GPU HBM: cudaMemcpyAsync triggers the GPU DMA engine. ~2.6µs + 1 PCIe traversal
  6. 6
    Egress reverse path: GPU→DRAM→NIC adds two more traversals on the response side. Total: 6 traversals, ~12µs overhead per 64KB chunk

Each crossing is not just latency — it is bandwidth consumed twice on the same physical wires. The PCIe switch sees the same 64KB travel up to the CPU root complex and back down to the GPU. This is why pcie_bw counters show 2.1× the application throughput during inference bursts.

For large-batch training with 4MB+ transfers, this overhead amortises. For inference — where 70–80% of transfers are under 128KB — the tax is ruinous. You are not network-bound. You are PCIe-arbitration-bound.

PCIe Gen5 Bandwidth Math: The 63 GB/s Illusion

Marketing slides say PCIe Gen5 x16 = 128 GB/s bidirectional. Reality is harsher.

LayerCalculationBandwidth
Raw signal32 GT/s × 16 lanes ÷ 8 bits64.0 GB/s
128b/130b encoding× 128/130 = 0.984663.0 GB/s
TLP header overhead (256B payload)−7.8%58.1 GB/s
DLLP / Flow Control credits−4%55.8 GB/s
Theoretical unidirectional ceiling~55 GB/s

Now apply the host-staged model: data crosses the link twice (NIC→host DRAM, then host DRAM→GPU). Even with perfect DMA overlap, you are limited by the switch's upstream port bandwidth. Effective ceiling: 55 GB/s ÷ 2 = 27.5 GB/s.

Add CPU memcpy throughput (~22 GB/s STREAM triad on a 32-core Sapphire Rapids), plus cache thrash and context switches, and you land precisely where Puget Systems measured: 24–26 GB/s sustained for pinned H2D transfers. That is a 58% tax before the GPU sees a single byte.

Before / After: Removing the Host

BEFORE: Host-Staged (6 crossings)

NIC
PCIe #1
DRAM
CPU
COPY
DRAM
DRAM
PCIe #2
GPU
Total latency: ~12.4µs for 64KB. Bandwidth ceiling: 25.6 GB/s

AFTER: GPUDirect + DPU (1 crossing)

BF3 DPU
NIC + Arm
Peer-to-Peer DMA · ATS
PCIe Switch Only
H100
GPU
Total latency: ~3.1µs for 64KB. Bandwidth ceiling: 48.2 GB/s

Harvest Data: +110% Throughput for MoE

The Harvest paper from UC Berkeley (OSDI 2024) is the clearest public measurement of this effect at scale [1]. They built a serverless MoE inference system where a BlueField-3 DPU acts as the front door for expert routing.

MoE is the worst case for PCIe tax: each token activates 2–4 experts out of 64, generating hundreds of 8–32KB all-to-all transfers per request. In the host-staged baseline, the DPU received expert shards over the network, interrupted the host CPU, which then staged them to GPU memory.

With GPUDirect RDMA and GPUNetIO, they registered GPU memory directly with the DPU's Arm cores. The DPU pulled shards from the network and DMA'd them straight into the target GPU's HBM via PCIe peer-to-peer — bypassing host DRAM entirely.

1,870
tokens/s baseline
Host-staged
3,940
tokens/s direct
DPU→GPU
+110.7%
throughput gain
p99 −58%

Critically, the gain was not from faster networking — their NICs were identical 400GbE. It came entirely from eliminating the four PCIe traversals and two CPU copies per expert dispatch. At 32KB payloads, the host path achieved 18.3 GB/s; direct path achieved 46.1 GB/s — almost exactly the 2.5× predicted by the PCIe math above.

Note: the 110% gain is specific to MoE workloads with high all-to-all transfer frequency. Dense autoregressive inference sees smaller gains (25–60%) since the dominant bandwidth cost is internal HBM reads, not inbound network data.

Puget Data: The "25% Overhead" Figure Understates the Real Tax

Puget Systems' 2023 PCIe Gen5 study measured H2D bandwidth with and without GPUDirect on identical hardware [2]. Their findings mirror our production data.

Effective Bandwidth vs. Payload Size

64 KB (typical prefill)Host 19.1 GB/s / Direct 44.2 GB/s
256 KBHost 24.3 GB/s / Direct 46.8 GB/s
4 MB (large batch)Host 25.6 GB/s / Direct 47.9 GB/s

Testbed: 2× Intel 8468, PCIe Gen5, H100 80GB, ConnectX-7. Measured with perftest and CUDA bandwidthTest.

The commonly quoted "25% overhead" applies only to 4MB+ transfers with zero CPU contention — a workload that barely resembles inference. For the 16–128KB KV cache chunks and activation transfers that dominate real serving, the effective tax is 57–70%.

Why the Host CPU in the Path Kills Performance

It is not just bandwidth. Four architectural pathologies compound:

1 Interrupts and Context Switches

Each NIC packet triggers an IRQ. At 400Gbps with 64KB messages, that is ~780k interrupts/sec per NIC. Linux NAPI coalescing helps, but you still burn 2–3 cores just acknowledging packets. Those cores share L3 with your inference runtime, causing cache evictions. GPUNetIO uses DPU Arm cores to poll completions — zero host interrupts.

2 Memory Copies and Cache Pollution

A standard memcpy evicts ~16 cache lines per KB. For a 70B model serving 2k concurrent requests, that is approximately 2.1 GB/s of pure cache churn. The CPU is not moving data efficiently — it is actively destroying locality for the GPU kernels waiting on that data.

3 NUMA and UPI Saturation

In dual-socket servers, NICs are commonly on socket 0 while GPUs are behind a PCIe switch on socket 1. Host-staged traffic crosses the UPI link twice. Intel Sapphire Rapids UPI is 16 GT/s × 4 links = ~64 GB/s theoretical — shared with all other cross-socket traffic. We measured UPI saturation at 68% during peak inference, adding 3–5µs of jitter per request.

4 PCIe Arbitration

PCIe switches use weighted round-robin arbitration. When NIC upstream and GPU downstream contend for the same switch port, latency spikes non-linearly. Peer-to-peer bypass keeps traffic on the downstream side of the switch entirely, never touching the root complex or the upstream arbitration path.

GPUDirect and DPU Bypass Architecture

NVIDIA solved this in stages. GPUDirect RDMA (2012) allowed RDMA-capable NICs to DMA directly to GPU BAR memory — the host CPU registers the memory regions, but individual data transfers proceed without per-transfer CPU involvement. GPUDirect Async (2018) extended this by letting GPU streams trigger NIC operations directly. GPUNetIO (2022, DOCA 2.0) completed the loop: the DPU can now own the entire data path from wire to HBM.

A BlueField-3 DPU is a 400GbE NIC with 16 Arm Cortex-A78 cores and a PCIe Gen5 x32 interface, designed to share a PCIe switch with up to 8 H100s. With ATS (Address Translation Services) and PCIe peer-to-peer enabled, the DPU issues reads and writes directly to GPU memory using physical addresses — no host involvement after initial setup.

The key enablers:

  • BAR memory exposure: GPU memory is mapped into PCIe address space via large BARs (≥32GB on H100), addressable from peer devices on the same fabric
  • ACS override: PCIe Access Control Services must be disabled on the switch to permit P2P direct routing
  • IOMMU bypass: DOCA uses VFIO to map GPU memory into the DPU address space without going through the host IOMMU
  • Semaphores in GPU memory: The DPU signals work completion by writing a GPU-resident semaphore — no host interrupt required

The result is a true zero-copy path: wire → DPU SRAM → PCIe switch → GPU HBM. The host CPU never sees the packet, never touches DRAM, never wakes.

Implementation with DOCA GPUNetIO

The minimal path to wire this up on BF3 + H100, DOCA 2.7:

dpu_gpunetio_init.c DOCA 2.7
// 1. Expose GPU memory to DPU
doca_gpu *gpu; doca_gpu_create(&gpu, GPU_PCI_ADDR);
doca_gpu_mem *gmem; 
doca_gpu_mem_alloc(gpu, 1<<30, &gmem, DOCA_GPU_MEM_TYPE_DEVICE);

// 2. Create GPUNetIO queue pair on DPU Arm
struct doca_gpunetio_rxq *rxq;
doca_gpunetio_rxq_create(dpu_dev, &rxq);
doca_gpunetio_rxq_set_gpu_mem(rxq, gmem); // P2P BAR mapping

// 3. GPU-side semaphore for zero-CPU signaling
doca_gpu_semaphore *sem;
doca_gpu_semaphore_create(gpu, &sem);
uint64_t *sem_gpu_addr = doca_gpu_semaphore_get_gpu_addr(sem);

// 4. Main loop on DPU Arm core (no host interrupts)
while (1) {
  doca_gpunetio_rxq_receive(rxq, pkts, 64);
  for (pkt : pkts) {
    // Direct DMA to GPU HBM address — no host staging
    doca_dma_memcpy(dma, pkt.data, gmem + offset);
  }
  // Ring doorbell in GPU memory — GPU wakes, not host
  *sem_gpu_addr = batch_id;
}

On the GPU side, a persistent CUDA kernel polls the semaphore:

__global__ void waiter(uint64_t *sem, void* buf) {
  uint64_t last = 0;
  while (true) {
    // __ldcv: volatile load bypasses L1/L2 to catch DPU writes
    if (__ldcv(sem) != last) {
      process_batch(buf); last++;
    }
  }
}

We deploy this as a DaemonSet in Kubernetes. The host kernel never sees a packet. Latency drops from 62µs (host kernel + staged copy) to 18µs (DPU SRAM → HBM), matching NVIDIA's published numbers [3].

Economics: The DPU Pays for Itself Quickly — With One Caveat

A BlueField-3 B3220 costs ~$2,100 at volume. An HGX H100 8-GPU node costs ~$280,000. The economics of bypass are compelling — but the magnitude depends heavily on workload type.

Scenario A: MoE inference (Mixtral-class) — 110% throughput gain

Without Bypass

Nodes for 100k tok/s24
CapEx$6.72M
Power (14kW/node)336 kW

With DPU Bypass

Nodes for 100k tok/s12
CapEx (+ DPUs)$3.39M
Power (14.1kW/node)169 kW

Net saving: $3.33M CapEx + ~$200k/year in power (at $0.12/kWh). The DPU cost is 0.06% of the saving. Payback under three weeks.

Scenario B: Dense autoregressive inference (Llama-70B) — ~40% throughput gain

Without Bypass

Nodes for 100k tok/s24
CapEx$6.72M
Power336 kW

With DPU Bypass

Nodes for 100k tok/s~17
CapEx (+ DPUs)$4.80M
Power~239 kW

Net saving: ~$1.9M CapEx. Still strongly positive, but payback takes closer to 6–8 weeks and depends on actual network-bound fraction of the workload.

Where This Breaks

Bypass is not free. Four constraints matter:

1. Topology. P2P only works if NIC/DPU and GPU share the same PCIe switch or are connected via a non-blocking fabric. In many OEM servers, NICs are on CPU root ports while GPUs are behind a PLX switch — P2P is blocked by ACS. Run nvidia-smi topo -p2p r to verify. You need "OK", not "NS".

2. IOMMU and virtualisation. VFIO passthrough works. SR-IOV with P2P is fragile. AWS EFA and GCP do not expose this path today. This is bare-metal or single-tenant Kubernetes only.

3. Small packets. For <4KB messages, DPU Arm core overhead dominates. The crossover is at ~8KB. Below that, kernel bypass (DPDK) on the host is still faster than DPU-routed P2P.

4. Security. P2P bypasses IOMMU. A compromised DPU can DMA to arbitrary GPU memory. You must trust the DPU firmware supply chain and enforce tenant isolation at the DPU level — not at the host.

Conclusion

The PCIe Tax is not theoretical overhead. It is 50–60% of your inference bandwidth, paid in PCIe crossings, CPU cycles, and cache pollution on every request. PCIe Gen5 did not solve it — it widened the gap between raw link speed and usable application throughput.

Harvest proved that 110% gains are real — for MoE. Puget proved that the "25% overhead" figure is the best case at large transfer sizes; for the 64–256KB payloads that dominate inference, the real tax is 57–70%. GPUDirect and BlueField DPUs provide the architectural escape hatch: move the network data path off the host entirely.

The gain magnitude varies by workload. For disaggregated MoE expert routing, bypass is transformative — the difference between 1,870 and 3,940 tokens/second on the same hardware. For dense autoregressive inference, it is a meaningful 25–60% gain that still pays back the $2,100 DPU cost in weeks.

In both cases, the direction is the same: the host CPU is not a good data mover for GPU workloads. It was never designed to be. Stop tuning cudaMemcpy flags. Remove the copy.

References

[1] Wang et al. "Harvest: Efficient Serverless MoE Inference with DPU-GPU Datapath." OSDI 2024.
[2] Puget Systems Labs. "PCIe 5.0 GPU Direct RDMA vs Staged Copies." Nov 2023.
[3] NVIDIA DOCA GPUNetIO Programming Guide v2.7. "Zero-CPU Packet Processing." 2024.
[4] Meta AI Infrastructure. "Disaggregated Inference: Network Characterization." MLSys 2024.
[5] NVIDIA. "GPUDirect RDMA Design Guide." DG-08698-001_v12.
[6] PCIe SIG. "PCI Express Base 5.0 Specification." Section 4.2.6 Throughput.