← All writings
Systems Deep Dive · Connectivity Series
800G EthernetInfiniBand Ultra EthernetRoCEv2 Scale-Out FabricAI Networking

800G Ethernet vs. InfiniBand: The AI Scale-Out Decision Nobody Documents Honestly

MANISH AI · April 2026 · 20 min read · Connectivity Series

The InfiniBand vs. Ethernet debate has been running for a decade. Most of it has been vendor marketing. This essay cuts through to the actual physics, protocol, and operational tradeoffs — with specific numbers — to explain why the answer depends entirely on your traffic pattern, your congestion control requirements, and your operational team.

Key Numbers
IB latency advantage over RoCEv2 at scale (µs level)
30–40%cost premium for IB vs. 800G Ethernet at same port count
3congestion control mechanisms IB provides that Ethernet lacks natively
~0native multicast support in RDMA — both IB and RoCE handle it differently
Contents
  1. Why the debate is framed wrong
  2. InfiniBand architecture: what it actually provides
  3. RoCEv2: Ethernet's RDMA answer and its real constraints
  4. Congestion control: where the performance gap actually lives
  5. Ultra Ethernet Consortium: fixing RoCEv2's congestion problem
  6. AI training traffic patterns and which fabric handles them better
  7. Inference traffic: a different animal entirely
  8. Operational reality: who can run IB and who should use Ethernet
  9. The actual decision framework
  10. The convergence thesis and what it means for future clusters

1. Why the debate is framed wrong

The IB vs. Ethernet debate is usually framed as a bandwidth question: which technology provides more GB/s between GPUs? At 800G, both technologies are at parity on raw bandwidth per port. The question resolved itself at the physical layer years ago. Yet the debate persists because the bandwidth framing misses what actually matters.

The real question is: for the specific traffic patterns of AI training and inference, which fabric provides better tail latency, better congestion behavior under adversarial load, and lower operational cost at scale? The answer to each of these is different, and they point in different directions depending on your workload.

The framing this essay uses: Raw bandwidth is not the decision variable. Congestion control, transport semantics, and operational complexity are. Most teams making the IB vs. Ethernet decision are optimizing for bandwidth (which is at parity) instead of congestion behavior and operational fit (which differ substantially).

2. InfiniBand architecture: what it actually provides

InfiniBand is a complete network architecture — it includes the physical layer, data link layer, network layer, transport layer, and a rich set of services built on top. Understanding what each layer provides is necessary to understand where IB's advantages come from and which of them are genuinely hard to replicate.

Physical layer: InfiniBand NDR (Next Data Rate) provides 400 Gbps per port using 4× 100 Gbps lanes with PAM4 signaling. XDR (roadmap) doubles this to 800 Gbps. Physical layer is now at parity with 800G Ethernet — no intrinsic advantage either way.

Data link layer: InfiniBand uses a credit-based flow control scheme at the link level. Before sending data, a transmitter must have credits from the receiver indicating available buffer space. Credits are fine-grained (measured in 64-byte units). This makes InfiniBand links lossless by construction — packets are never dropped at the switch level because the switch only accepts a packet when it has confirmed the downstream link has buffer space. Losslessness is foundational to RDMA performance.

Network layer: InfiniBand has its own routable network layer with 16-bit Local Identifiers (LIDs) for subnet routing and 128-bit Global Identifiers (GUIDs) for inter-subnet routing. The subnet manager — an IB-specific control plane component — computes routing tables and distributes them to all switches. This is fundamentally different from IP routing and requires IB-specific tooling and expertise.

Transport layer: IB provides multiple transport types — Reliable Connected (RC), Unreliable Connected (UC), and Unreliable Datagram (UD) — each with different performance and semantic tradeoffs. RDMA operations (Read, Write, Send, Atomic) are implemented in hardware on the Host Channel Adapter (HCA). The hardware transport means that RDMA operations complete without CPU involvement in the data path.

2.1 What makes IB genuinely hard to replicate

Three IB properties are difficult to replicate with Ethernet:

End-to-end credit-based flow control. IB's credit system extends from sender through every switch hop to receiver. At no point in the network can a packet be dropped due to buffer overflow — the network back-pressures the sender before buffers fill. Ethernet's flow control (PFC — Priority Flow Control) operates per-hop and creates persistent pause frames that propagate backward through the fabric, causing head-of-line blocking and potentially deadlock.

Subnet manager and adaptive routing. IB's centralized subnet manager can observe the global traffic matrix and update routing tables in real-time to avoid hot links. Ethernet's distributed ECMP routing makes locally-optimal decisions that can globally create congestion. IB's adaptive routing (NVIDIA's SHIELD, Cornelis's ARK) reroutes individual packets based on observed per-port queue depths.

Hardware-level congestion notification. IB switches inject congestion notification packets into the data stream when they detect queue buildup. Senders receive these notifications and reduce their injection rate. This is a hardware-level closed-loop control system with ~microsecond reaction time. Ethernet's DCQCN (Data Center Quantized Congestion Notification) is a software-assisted approximation with higher latency in the control loop.

3. RoCEv2: Ethernet's RDMA answer and its real constraints

RoCEv2 (RDMA over Converged Ethernet v2) is the standard for running RDMA verbs over IP/Ethernet. It provides the same RDMA programming model as InfiniBand — the same verbs (Send/Receive, RDMA Write/Read, Atomic), the same memory registration model, the same queue pair abstraction — but runs over IP/UDP rather than IB transport.

This means that software written for IB RDMA can run on RoCEv2 with minimal changes. NCCL supports both. The MPI implementations support both. The application layer is largely portable. The difference is in what happens when traffic gets congested.

RoCEv2 vs. IB transport comparison
                    IB RC transport        RoCEv2 (UDP/IP)
Lossless guarantee  Yes (credit-based)     Requires PFC + ECN (lossy by default)
Congestion control  Hardware CC (SHIELD)   DCQCN (SW-assisted, higher latency)
Timeout/retry       Hardware retransmit    SW retransmit via ibverbs
Packet ordering     Guaranteed per QP      Guaranteed per flow (5-tuple)
Multipath           Adaptive routing       ECMP (per-flow, not per-packet)
Deadlock risk       Low (per-VL credits)   Yes under PFC with cyclic deps
MTU                 4096 bytes             Up to 9000 bytes (jumbo frames)
Address model       LID/GID               IP address
Subnet management   IB Subnet Manager     Standard IP routing
Ops complexity      High (IB-specific)    Moderate (IP tools work)

The losslessness issue is the core. RDMA over a lossy network is not just slower — it is fundamentally broken for many operations. RDMA Write requires that the receiver side has already posted a buffer for the data to land in. If a packet is lost in transit, the write operation fails silently — the data ends up in the wrong buffer or triggers a protection fault. The RDMA transport must therefore guarantee delivery. IB does this in hardware through credits. RoCEv2 does it through a combination of PFC (to make the network lossless) and retransmission (as a backup when PFC fails).

3.1 The PFC problem: pause frames and deadlock

PFC (Priority Flow Control) is Ethernet's mechanism for creating a lossless network. When a switch port's receive buffer fills, it sends a PAUSE frame to the upstream sender on that specific traffic class, stopping that sender for a specified duration. This prevents the switch from dropping packets.

The problem is that PAUSE frames propagate. If switch A pauses sender B, and sender B was also receiving data from sender C, then B's switch port toward C fills and B pauses C. The pause propagates backward through the fabric. In a network with cyclic dependencies — which is easy to create accidentally with multipath routing — a permanent deadlock can form where every flow is paused waiting for a flow that is also paused.

InfiniBand avoids this through its credit system: no packet is ever sent unless the downstream buffer has confirmed capacity. PFC-based systems react after the fact — they pause when buffer fills rather than preventing the fill. The reaction latency means that PFC is always operating in a mode where buffers are partially full, which creates baseline latency degradation even when deadlock does not occur.

4. Congestion control: where the performance gap actually lives

At 800G with no congestion, IB and RoCEv2 perform comparably. The performance gap opens under congestion — and AI training generates a lot of congestion. AllReduce operations across thousands of GPUs create incast patterns: many senders simultaneously targeting a small number of aggregation points. This is exactly the traffic pattern where congestion control quality matters most.

Incast congestion scenario — AllReduce at 1,024 GPUs
AllReduce ring-reduce phase:
  1,024 GPUs, each sending 1/1,024 of gradient tensor to next ring neighbor
  Simultaneously: multiple aggregation steps in progress
  → Multiple flows arriving at each switch port simultaneously
  → Switch port receive buffer fills in microseconds

IB response:
  Credit system: sender throttles before buffer fills
  Adaptive routing: hot ports detected, flows rerouted within microseconds
  Result: buffer occupancy stays low, latency increase ~10-20%

RoCEv2 / DCQCN response:
  ECN marks packets when queue exceeds threshold
  Marked packets trigger rate reduction at sender (via RoCEv2 CNP packets)
  Round-trip time for notification: ~2-5 µs
  During 2-5 µs RTT: packets continue arriving at full rate
  → Queue builds, PFC PAUSE triggered
  → PAUSE propagates backward: affects ALL flows on that port, not just congested one
  → Head-of-line blocking: non-congested flows paused alongside congested flows
  Result: p99 latency increase 3-5× during AllReduce phases

The 3-5× p99 latency increase during AllReduce is not a worst-case — it is a typical case in production clusters under heavy training load. This translates directly to GPU utilization: GPUs waiting for AllReduce to complete are idle. The tail latency in collective operations determines the effective utilization of the weakest link in the synchronization barrier.

5. Ultra Ethernet Consortium: fixing RoCEv2's congestion problem

The Ultra Ethernet Consortium (UEC), formed in 2023 with participation from AMD, Broadcom, Cisco, Intel, Meta, and Microsoft, is developing a new transport standard designed to provide RDMA semantics over Ethernet with better congestion control than RoCEv2. The key technical improvements are:

Per-packet adaptive routing. Unlike ECMP (which assigns entire flows to paths), UEC specifies per-packet load balancing with out-of-order delivery tolerance. This allows traffic to use all available paths simultaneously rather than being hashed onto a subset.

Packet spraying. Packets from a single flow can be sent across multiple paths and reordered at the receiver. This eliminates the "unlucky hash" problem in ECMP where a flow consistently maps to a congested path.

Improved congestion control with sub-RTT reaction. UEC specifies switch-assisted congestion signals that are injected into the data path with lower latency than DCQCN's ECN-based mechanism, enabling faster reaction without PFC propagation.

PropertyStandard RoCEv2Ultra Ethernet (UEC)InfiniBand NDR
Lossless mechanismPFC + ECNUEC CC (no PFC needed)Credit-based (no loss)
MultipathECMP per-flowPer-packet sprayAdaptive per-packet (SHIELD)
Congestion reaction~RTT (µs)Sub-RTT (switch-assisted)Hardware (~0.1 µs)
Deadlock riskYes (PFC)Low (no PFC)Very low (credits)
EcosystemMulti-vendor, matureEmerging (2025-2026)NVIDIA-dominated
Ops complexityModerateModerateHigh

UEC is promising but has not been widely deployed in production AI clusters as of early 2026. The implementations are arriving in silicon from Broadcom (Thor-2) and Marvell (Octeon 10), but the ecosystem of drivers, NCCL backends, and operational tooling is still maturing. The teams making cluster fabric decisions today are choosing between proven RoCEv2 (with known congestion issues) and proven IB (with known operational complexity), not between either of those and a not-yet-validated UEC deployment.

6. AI training traffic patterns and which fabric handles them better

AI training generates four distinct traffic patterns that stress the network differently:

AllReduce (data parallel gradient sync): Many-to-many, high bandwidth, sustained. IB's adaptive routing and hardware congestion control handle the incast patterns better. RoCEv2 with well-tuned DCQCN is adequate but requires careful configuration and more conservative bandwidth utilization (typically 70-75% of line rate to keep PFC from triggering).

Pipeline parallel activation passing: Directed point-to-point flows between adjacent pipeline stages. Both IB and RoCEv2 handle this well — congestion is minimal because traffic patterns are regular and predictable. No meaningful advantage for either.

Expert dispatch (MoE): Many-to-few, bursty, with hot destinations. IB's adaptive routing reroutes flows away from hot expert GPUs more quickly. RoCEv2 ECMP creates persistent hot paths. This is a measurable IB advantage for large MoE training.

Checkpoint I/O: Periodic bulk transfers from GPUs to storage. Standard TCP/IP or RDMA to storage targets. Both IB and Ethernet work. Not a differentiator.

7. Inference traffic: a different animal entirely

Inference has a fundamentally different traffic profile from training — and this changes the IB vs. Ethernet calculus significantly.

Training is dominated by AllReduce: high bandwidth, many-to-many, sustained, synchronous. This favors IB's congestion handling. Inference is dominated by: KV cache transfers (directed, point-to-point, medium bandwidth), MoE dispatch (bursty, many-to-few), and prefill-decode KV migration (large sequential transfers). These patterns are less congestion-intensive and more bandwidth-oriented.

For inference specifically, the IB latency advantage (0.6 µs vs. 1.2-1.8 µs MPI-level) matters for token latency when network transfers are in the critical path. KV cache fetches of small pages (the 73% under 128KB case from the PCIe Tax essay) are exactly where low latency matters most. IB provides a genuine latency advantage here that accumulates per-hop across inference critical paths.

However: at the scale most inference clusters operate (hundreds to a few thousand GPUs, not tens of thousands), RoCEv2 congestion events are rare enough that the operational simplicity advantage of Ethernet outweighs the latency penalty. The inflection point is approximately cluster size >2,000 GPUs running dense multi-node inference workloads with synchronous collective communications.

8. Operational reality: who can run IB and who should use Ethernet

InfiniBand clusters require specialized operational knowledge that is genuinely difficult to acquire. The IB subnet manager must be configured correctly for every topology change. Adaptive routing parameters must be tuned for the specific traffic pattern. Congestion isolation requires careful virtual lane assignment. Debugging performance problems requires IB-specific telemetry tools (perfquery, ibdiagnet, etc.) that have no Ethernet equivalents.

Operational Reality Check

A misconfigured IB subnet manager can silently degrade performance by 40-60% without any obvious error messages. A misconfigured PFC on a RoCEv2 cluster can trigger cluster-wide pause storms that look like GPU slowdowns. Both technologies fail in silent, hard-to-diagnose ways — but IB requires more specialized knowledge to configure correctly in the first place. If your networking team is primarily IP/Ethernet engineers, starting with RoCEv2 and investing in DCQCN tuning is likely to produce better results than deploying IB without IB expertise.

The hyperscalers who run IB at scale (primarily NVIDIA-infrastructure-heavy shops) have invested heavily in IB operational expertise. Meta's decision to build their AI Research SuperCluster on RoCEv2 was partly a reflection of their Ethernet-first operational culture — they have deep IP networking expertise and chose to apply it to AI rather than build parallel IB expertise.

9. The actual decision framework

Given all of the above, here is the honest decision framework:

ScenarioRecommended FabricRationale
<1,000 GPU cluster, mixed training/inferenceRoCEv2 / 400G-800G EthernetCongestion events rare; operational simplicity wins; lower cost
1,000-10,000 GPU training clusterInfiniBand NDR if team has IB expertise; RoCEv2 + careful DCQCN tuning otherwiseCongestion starts mattering; IB advantage real but requires expertise
>10,000 GPU frontier trainingInfiniBand or UEC (when available)AllReduce congestion at this scale makes DCQCN limitations costly
Inference-only clusterRoCEv2 / 800G EthernetTraffic pattern less congestion-prone; latency gap acceptable for inference SLOs
MoE training at scaleInfiniBandExpert dispatch incast handled significantly better
Hyperscaler with IP-first ops teamRoCEv2Operational fit trumps raw performance in 24/7 production environments

10. The convergence thesis and what it means for future clusters

The IB vs. Ethernet debate will eventually resolve — not because one technology wins, but because the gap is closing from both sides. InfiniBand is adopting more standard IP interfaces (GRH headers, IP addressing) to reduce operational friction. Ultra Ethernet is adopting IB-like congestion control mechanisms to close the performance gap. The future state is likely a convergence on a common set of transport semantics — RDMA, lossless, adaptive routing — implemented over a substrate that is administratively IP-compatible.

The Ultra Ethernet Consortium's work is the most concrete signal of this convergence. If UEC delivers on its technical specifications in production, it will eliminate most of the genuine performance advantages of InfiniBand while retaining Ethernet's operational accessibility. At that point, the IB vs. Ethernet decision becomes straightforward: Ethernet, at lower cost, with familiar tooling, and without vendor lock-in.

Until UEC is production-validated, the honest answer is: the decision depends on your cluster scale, traffic pattern, operational team expertise, and tolerance for the operational complexity of IB. There is no universal answer. Anyone who tells you otherwise is selling something.