800G Ethernet vs. InfiniBand: The AI Scale-Out Decision Nobody Documents Honestly
The InfiniBand vs. Ethernet debate has been running for a decade. Most of it has been vendor marketing. This essay cuts through to the actual physics, protocol, and operational tradeoffs — with specific numbers — to explain why the answer depends entirely on your traffic pattern, your congestion control requirements, and your operational team.
- InfiniBand NDR: 400 Gbps per port; XDR (roadmap): 800 Gbps per port
- 800G Ethernet (Ultra Ethernet / RoCEv2): 800 Gbps per port, now in production at hyperscalers
- InfiniBand latency advantage: ~0.6 µs MPI vs. ~1.2-1.8 µs for RoCEv2 at equivalent scale
- Meta's RoCEv2 cluster: over 24,000 GPUs running production training on Ethernet-based RDMA
- Congestion control is where 90% of the real performance difference lives — not raw bandwidth
- Why the debate is framed wrong
- InfiniBand architecture: what it actually provides
- RoCEv2: Ethernet's RDMA answer and its real constraints
- Congestion control: where the performance gap actually lives
- Ultra Ethernet Consortium: fixing RoCEv2's congestion problem
- AI training traffic patterns and which fabric handles them better
- Inference traffic: a different animal entirely
- Operational reality: who can run IB and who should use Ethernet
- The actual decision framework
- The convergence thesis and what it means for future clusters
1. Why the debate is framed wrong
The IB vs. Ethernet debate is usually framed as a bandwidth question: which technology provides more GB/s between GPUs? At 800G, both technologies are at parity on raw bandwidth per port. The question resolved itself at the physical layer years ago. Yet the debate persists because the bandwidth framing misses what actually matters.
The real question is: for the specific traffic patterns of AI training and inference, which fabric provides better tail latency, better congestion behavior under adversarial load, and lower operational cost at scale? The answer to each of these is different, and they point in different directions depending on your workload.
The framing this essay uses: Raw bandwidth is not the decision variable. Congestion control, transport semantics, and operational complexity are. Most teams making the IB vs. Ethernet decision are optimizing for bandwidth (which is at parity) instead of congestion behavior and operational fit (which differ substantially).
2. InfiniBand architecture: what it actually provides
InfiniBand is a complete network architecture — it includes the physical layer, data link layer, network layer, transport layer, and a rich set of services built on top. Understanding what each layer provides is necessary to understand where IB's advantages come from and which of them are genuinely hard to replicate.
Physical layer: InfiniBand NDR (Next Data Rate) provides 400 Gbps per port using 4× 100 Gbps lanes with PAM4 signaling. XDR (roadmap) doubles this to 800 Gbps. Physical layer is now at parity with 800G Ethernet — no intrinsic advantage either way.
Data link layer: InfiniBand uses a credit-based flow control scheme at the link level. Before sending data, a transmitter must have credits from the receiver indicating available buffer space. Credits are fine-grained (measured in 64-byte units). This makes InfiniBand links lossless by construction — packets are never dropped at the switch level because the switch only accepts a packet when it has confirmed the downstream link has buffer space. Losslessness is foundational to RDMA performance.
Network layer: InfiniBand has its own routable network layer with 16-bit Local Identifiers (LIDs) for subnet routing and 128-bit Global Identifiers (GUIDs) for inter-subnet routing. The subnet manager — an IB-specific control plane component — computes routing tables and distributes them to all switches. This is fundamentally different from IP routing and requires IB-specific tooling and expertise.
Transport layer: IB provides multiple transport types — Reliable Connected (RC), Unreliable Connected (UC), and Unreliable Datagram (UD) — each with different performance and semantic tradeoffs. RDMA operations (Read, Write, Send, Atomic) are implemented in hardware on the Host Channel Adapter (HCA). The hardware transport means that RDMA operations complete without CPU involvement in the data path.
2.1 What makes IB genuinely hard to replicate
Three IB properties are difficult to replicate with Ethernet:
End-to-end credit-based flow control. IB's credit system extends from sender through every switch hop to receiver. At no point in the network can a packet be dropped due to buffer overflow — the network back-pressures the sender before buffers fill. Ethernet's flow control (PFC — Priority Flow Control) operates per-hop and creates persistent pause frames that propagate backward through the fabric, causing head-of-line blocking and potentially deadlock.
Subnet manager and adaptive routing. IB's centralized subnet manager can observe the global traffic matrix and update routing tables in real-time to avoid hot links. Ethernet's distributed ECMP routing makes locally-optimal decisions that can globally create congestion. IB's adaptive routing (NVIDIA's SHIELD, Cornelis's ARK) reroutes individual packets based on observed per-port queue depths.
Hardware-level congestion notification. IB switches inject congestion notification packets into the data stream when they detect queue buildup. Senders receive these notifications and reduce their injection rate. This is a hardware-level closed-loop control system with ~microsecond reaction time. Ethernet's DCQCN (Data Center Quantized Congestion Notification) is a software-assisted approximation with higher latency in the control loop.
3. RoCEv2: Ethernet's RDMA answer and its real constraints
RoCEv2 (RDMA over Converged Ethernet v2) is the standard for running RDMA verbs over IP/Ethernet. It provides the same RDMA programming model as InfiniBand — the same verbs (Send/Receive, RDMA Write/Read, Atomic), the same memory registration model, the same queue pair abstraction — but runs over IP/UDP rather than IB transport.
This means that software written for IB RDMA can run on RoCEv2 with minimal changes. NCCL supports both. The MPI implementations support both. The application layer is largely portable. The difference is in what happens when traffic gets congested.
RoCEv2 vs. IB transport comparison IB RC transport RoCEv2 (UDP/IP)
Lossless guarantee Yes (credit-based) Requires PFC + ECN (lossy by default)
Congestion control Hardware CC (SHIELD) DCQCN (SW-assisted, higher latency)
Timeout/retry Hardware retransmit SW retransmit via ibverbs
Packet ordering Guaranteed per QP Guaranteed per flow (5-tuple)
Multipath Adaptive routing ECMP (per-flow, not per-packet)
Deadlock risk Low (per-VL credits) Yes under PFC with cyclic deps
MTU 4096 bytes Up to 9000 bytes (jumbo frames)
Address model LID/GID IP address
Subnet management IB Subnet Manager Standard IP routing
Ops complexity High (IB-specific) Moderate (IP tools work)
The losslessness issue is the core. RDMA over a lossy network is not just slower — it is fundamentally broken for many operations. RDMA Write requires that the receiver side has already posted a buffer for the data to land in. If a packet is lost in transit, the write operation fails silently — the data ends up in the wrong buffer or triggers a protection fault. The RDMA transport must therefore guarantee delivery. IB does this in hardware through credits. RoCEv2 does it through a combination of PFC (to make the network lossless) and retransmission (as a backup when PFC fails).
3.1 The PFC problem: pause frames and deadlock
PFC (Priority Flow Control) is Ethernet's mechanism for creating a lossless network. When a switch port's receive buffer fills, it sends a PAUSE frame to the upstream sender on that specific traffic class, stopping that sender for a specified duration. This prevents the switch from dropping packets.
The problem is that PAUSE frames propagate. If switch A pauses sender B, and sender B was also receiving data from sender C, then B's switch port toward C fills and B pauses C. The pause propagates backward through the fabric. In a network with cyclic dependencies — which is easy to create accidentally with multipath routing — a permanent deadlock can form where every flow is paused waiting for a flow that is also paused.
InfiniBand avoids this through its credit system: no packet is ever sent unless the downstream buffer has confirmed capacity. PFC-based systems react after the fact — they pause when buffer fills rather than preventing the fill. The reaction latency means that PFC is always operating in a mode where buffers are partially full, which creates baseline latency degradation even when deadlock does not occur.
4. Congestion control: where the performance gap actually lives
At 800G with no congestion, IB and RoCEv2 perform comparably. The performance gap opens under congestion — and AI training generates a lot of congestion. AllReduce operations across thousands of GPUs create incast patterns: many senders simultaneously targeting a small number of aggregation points. This is exactly the traffic pattern where congestion control quality matters most.
Incast congestion scenario — AllReduce at 1,024 GPUsAllReduce ring-reduce phase:
1,024 GPUs, each sending 1/1,024 of gradient tensor to next ring neighbor
Simultaneously: multiple aggregation steps in progress
→ Multiple flows arriving at each switch port simultaneously
→ Switch port receive buffer fills in microseconds
IB response:
Credit system: sender throttles before buffer fills
Adaptive routing: hot ports detected, flows rerouted within microseconds
Result: buffer occupancy stays low, latency increase ~10-20%
RoCEv2 / DCQCN response:
ECN marks packets when queue exceeds threshold
Marked packets trigger rate reduction at sender (via RoCEv2 CNP packets)
Round-trip time for notification: ~2-5 µs
During 2-5 µs RTT: packets continue arriving at full rate
→ Queue builds, PFC PAUSE triggered
→ PAUSE propagates backward: affects ALL flows on that port, not just congested one
→ Head-of-line blocking: non-congested flows paused alongside congested flows
Result: p99 latency increase 3-5× during AllReduce phases
The 3-5× p99 latency increase during AllReduce is not a worst-case — it is a typical case in production clusters under heavy training load. This translates directly to GPU utilization: GPUs waiting for AllReduce to complete are idle. The tail latency in collective operations determines the effective utilization of the weakest link in the synchronization barrier.
5. Ultra Ethernet Consortium: fixing RoCEv2's congestion problem
The Ultra Ethernet Consortium (UEC), formed in 2023 with participation from AMD, Broadcom, Cisco, Intel, Meta, and Microsoft, is developing a new transport standard designed to provide RDMA semantics over Ethernet with better congestion control than RoCEv2. The key technical improvements are:
Per-packet adaptive routing. Unlike ECMP (which assigns entire flows to paths), UEC specifies per-packet load balancing with out-of-order delivery tolerance. This allows traffic to use all available paths simultaneously rather than being hashed onto a subset.
Packet spraying. Packets from a single flow can be sent across multiple paths and reordered at the receiver. This eliminates the "unlucky hash" problem in ECMP where a flow consistently maps to a congested path.
Improved congestion control with sub-RTT reaction. UEC specifies switch-assisted congestion signals that are injected into the data path with lower latency than DCQCN's ECN-based mechanism, enabling faster reaction without PFC propagation.
| Property | Standard RoCEv2 | Ultra Ethernet (UEC) | InfiniBand NDR |
|---|---|---|---|
| Lossless mechanism | PFC + ECN | UEC CC (no PFC needed) | Credit-based (no loss) |
| Multipath | ECMP per-flow | Per-packet spray | Adaptive per-packet (SHIELD) |
| Congestion reaction | ~RTT (µs) | Sub-RTT (switch-assisted) | Hardware (~0.1 µs) |
| Deadlock risk | Yes (PFC) | Low (no PFC) | Very low (credits) |
| Ecosystem | Multi-vendor, mature | Emerging (2025-2026) | NVIDIA-dominated |
| Ops complexity | Moderate | Moderate | High |
UEC is promising but has not been widely deployed in production AI clusters as of early 2026. The implementations are arriving in silicon from Broadcom (Thor-2) and Marvell (Octeon 10), but the ecosystem of drivers, NCCL backends, and operational tooling is still maturing. The teams making cluster fabric decisions today are choosing between proven RoCEv2 (with known congestion issues) and proven IB (with known operational complexity), not between either of those and a not-yet-validated UEC deployment.
6. AI training traffic patterns and which fabric handles them better
AI training generates four distinct traffic patterns that stress the network differently:
AllReduce (data parallel gradient sync): Many-to-many, high bandwidth, sustained. IB's adaptive routing and hardware congestion control handle the incast patterns better. RoCEv2 with well-tuned DCQCN is adequate but requires careful configuration and more conservative bandwidth utilization (typically 70-75% of line rate to keep PFC from triggering).
Pipeline parallel activation passing: Directed point-to-point flows between adjacent pipeline stages. Both IB and RoCEv2 handle this well — congestion is minimal because traffic patterns are regular and predictable. No meaningful advantage for either.
Expert dispatch (MoE): Many-to-few, bursty, with hot destinations. IB's adaptive routing reroutes flows away from hot expert GPUs more quickly. RoCEv2 ECMP creates persistent hot paths. This is a measurable IB advantage for large MoE training.
Checkpoint I/O: Periodic bulk transfers from GPUs to storage. Standard TCP/IP or RDMA to storage targets. Both IB and Ethernet work. Not a differentiator.
7. Inference traffic: a different animal entirely
Inference has a fundamentally different traffic profile from training — and this changes the IB vs. Ethernet calculus significantly.
Training is dominated by AllReduce: high bandwidth, many-to-many, sustained, synchronous. This favors IB's congestion handling. Inference is dominated by: KV cache transfers (directed, point-to-point, medium bandwidth), MoE dispatch (bursty, many-to-few), and prefill-decode KV migration (large sequential transfers). These patterns are less congestion-intensive and more bandwidth-oriented.
For inference specifically, the IB latency advantage (0.6 µs vs. 1.2-1.8 µs MPI-level) matters for token latency when network transfers are in the critical path. KV cache fetches of small pages (the 73% under 128KB case from the PCIe Tax essay) are exactly where low latency matters most. IB provides a genuine latency advantage here that accumulates per-hop across inference critical paths.
However: at the scale most inference clusters operate (hundreds to a few thousand GPUs, not tens of thousands), RoCEv2 congestion events are rare enough that the operational simplicity advantage of Ethernet outweighs the latency penalty. The inflection point is approximately cluster size >2,000 GPUs running dense multi-node inference workloads with synchronous collective communications.
8. Operational reality: who can run IB and who should use Ethernet
InfiniBand clusters require specialized operational knowledge that is genuinely difficult to acquire. The IB subnet manager must be configured correctly for every topology change. Adaptive routing parameters must be tuned for the specific traffic pattern. Congestion isolation requires careful virtual lane assignment. Debugging performance problems requires IB-specific telemetry tools (perfquery, ibdiagnet, etc.) that have no Ethernet equivalents.
A misconfigured IB subnet manager can silently degrade performance by 40-60% without any obvious error messages. A misconfigured PFC on a RoCEv2 cluster can trigger cluster-wide pause storms that look like GPU slowdowns. Both technologies fail in silent, hard-to-diagnose ways — but IB requires more specialized knowledge to configure correctly in the first place. If your networking team is primarily IP/Ethernet engineers, starting with RoCEv2 and investing in DCQCN tuning is likely to produce better results than deploying IB without IB expertise.
The hyperscalers who run IB at scale (primarily NVIDIA-infrastructure-heavy shops) have invested heavily in IB operational expertise. Meta's decision to build their AI Research SuperCluster on RoCEv2 was partly a reflection of their Ethernet-first operational culture — they have deep IP networking expertise and chose to apply it to AI rather than build parallel IB expertise.
9. The actual decision framework
Given all of the above, here is the honest decision framework:
| Scenario | Recommended Fabric | Rationale |
|---|---|---|
| <1,000 GPU cluster, mixed training/inference | RoCEv2 / 400G-800G Ethernet | Congestion events rare; operational simplicity wins; lower cost |
| 1,000-10,000 GPU training cluster | InfiniBand NDR if team has IB expertise; RoCEv2 + careful DCQCN tuning otherwise | Congestion starts mattering; IB advantage real but requires expertise |
| >10,000 GPU frontier training | InfiniBand or UEC (when available) | AllReduce congestion at this scale makes DCQCN limitations costly |
| Inference-only cluster | RoCEv2 / 800G Ethernet | Traffic pattern less congestion-prone; latency gap acceptable for inference SLOs |
| MoE training at scale | InfiniBand | Expert dispatch incast handled significantly better |
| Hyperscaler with IP-first ops team | RoCEv2 | Operational fit trumps raw performance in 24/7 production environments |
10. The convergence thesis and what it means for future clusters
The IB vs. Ethernet debate will eventually resolve — not because one technology wins, but because the gap is closing from both sides. InfiniBand is adopting more standard IP interfaces (GRH headers, IP addressing) to reduce operational friction. Ultra Ethernet is adopting IB-like congestion control mechanisms to close the performance gap. The future state is likely a convergence on a common set of transport semantics — RDMA, lossless, adaptive routing — implemented over a substrate that is administratively IP-compatible.
The Ultra Ethernet Consortium's work is the most concrete signal of this convergence. If UEC delivers on its technical specifications in production, it will eliminate most of the genuine performance advantages of InfiniBand while retaining Ethernet's operational accessibility. At that point, the IB vs. Ethernet decision becomes straightforward: Ethernet, at lower cost, with familiar tooling, and without vendor lock-in.
Until UEC is production-validated, the honest answer is: the decision depends on your cluster scale, traffic pattern, operational team expertise, and tolerance for the operational complexity of IB. There is no universal answer. Anyone who tells you otherwise is selling something.