NVLink Switch Is Not NVLink: The Scale-Up Fabric Architecture Nobody Fully Explains
Everyone knows NVLink connects GPUs fast. Almost nobody understands what NVLink Switch actually is, how it differs architecturally from NVLink itself, why it changes the failure domain and the topology simultaneously, and what GB300 NVL72's all-to-all non-blocking fabric means at the physics level.
- 1.8 TB/s bidirectional per-GPU bandwidth across NVLink Switch fabric in GB300 NVL72
- 130 TB/s total aggregate bisection bandwidth in a fully loaded NVL72 tray — non-blocking
- 2 hops maximum between any two GPUs in a 72-GPU NVL switch domain
- 5.6× more bandwidth than PCIe Gen5 x16 for GPU-to-GPU transfers within the same domain
- NVLink Switch runs the NVLink protocol over dedicated switching silicon — it is not the same chip as the GPU's NVLink ports
- The NVLink naming confusion and why it matters
- What NVLink actually is at the port level
- The NVLink Switch chip: a dedicated switching ASIC
- How NVL72 builds a non-blocking all-to-all fabric from switch ASICs
- The bandwidth math: where 1.8 TB/s comes from
- NVLink Switch vs. InfiniBand for scale-up: not the same problem
- Failure domains and what non-blocking means under fault
- NCCL over NVLink Switch: how the software layer sees the fabric
- The scale-up / scale-out boundary: where NVLink ends and the problem begins
- What NVLink Switch teaches us about the future of AI interconnect
1. The NVLink naming confusion and why it matters
NVIDIA uses the word "NVLink" to describe three different things: the point-to-point serial link protocol between GPU chips, the NVLink Switch ASIC that forwards NVLink protocol frames between multiple GPUs, and the NVLink Switch System that comprises the full rack-scale fabric built from those ASICs. These are architecturally distinct components. Conflating them produces incorrect mental models of how the fabric actually behaves.
The confusion has real consequences. Engineers who think of NVLink as "the fast wire between two GPUs" misunderstand the latency profile of the switch domain (there are forwarding hops with non-zero latency). Engineers who think NVLink Switch is "just a bigger crossbar" misunderstand the failure mode taxonomy (switch ASIC failure vs. link failure vs. GPU failure produce different blast radii). Engineers who think the NVLink domain is "basically one big GPU" misunderstand the software model (NCCL still needs topology hints to generate optimal collective communication plans).
The framing this essay uses: NVLink is a protocol and physical link standard. NVLink Switch is a dedicated forwarding ASIC that implements that protocol in a switching context. NVLink Switch System (NVL72) is a topology built from those ASICs. Each level has distinct engineering tradeoffs and failure modes. Understanding all three is required to reason correctly about scale-up fabric design.
2. What NVLink actually is at the port level
NVLink at the physical link level is a high-speed differential serial link designed specifically for GPU-to-GPU and GPU-to-switch communication. It uses a point-to-point signaling scheme where each "link" is actually a bundle of lanes running at very high symbol rates. Each NVLink 4.0 link (Hopper generation) provides 25 GB/s bidirectional per link, and each H100 GPU has 18 such links for a total of 900 GB/s bidirectional aggregate NVLink bandwidth per GPU.
NVLink differs from PCIe in several fundamental ways. First, it is a coherent interconnect — it implements a cache-coherent protocol that allows GPU memories to participate in a shared address space. A GPU can read from another GPU's HBM using load instructions without explicit DMA operations. Second, it uses a credit-based flow control mechanism operating at very fine granularity (32-byte flit level), which gives it near-line-rate utilization even for small transfers — unlike PCIe, which has high per-transfer overhead that dominates for small messages. Third, it operates at substantially lower latency: NVLink 4.0 link latency is approximately 0.5 µs port-to-port, versus 1.5-2.5 µs for PCIe Gen5 over comparable distances.
NVLink 4.0 vs PCIe Gen5 — Physical layer comparison NVLink 4.0 PCIe Gen5 x16
Per-link bandwidth 25 GB/s bidir ~63 GB/s unidir (128b/130b encoded)
Links per H100 18 1 slot (x16)
Total GPU BW 900 GB/s bidir ~55 GB/s effective unidir
Coherency Yes (shared addr) No (DMA only)
Small-msg overhead ~32B flit granule TLP header (20-32B) + per-xfer setup
Port-to-port lat ~0.5 µs ~1.5-2.5 µs
Flow control Credit-based, fine Token bucket, coarse
Protocol Proprietary NVIDIA PCIe standard (open)
The coherence property is architecturally significant. It means that within a NVLink domain, a GPU can issue a load to an address that physically resides in another GPU's HBM and receive the data without the source GPU's CPU or software involvement. This is the mechanism that allows NVL72 to present the illusion of a large unified memory space to software running across all 72 GPUs — not a perfect illusion (remote HBM access is slower than local), but a real one that simplifies programming models for certain workloads.
3. The NVLink Switch chip: a dedicated switching ASIC
The NVLink Switch chip (codenamed "Laguna" in the Blackwell generation) is a dedicated switching ASIC — entirely separate from the GPU die. It implements the NVLink protocol in a switching context: it receives NVLink frames on input ports, inspects the destination address encoded in the frame header, and forwards the frame to the appropriate output port.
This is not a GPU. It has no compute, no HBM, no tensor cores. It is a packet-forwarding engine that happens to run the NVLink protocol. Its job is to provide full-bandwidth, low-latency forwarding between any pair of NVLink ports connected to it, simultaneously, without head-of-line blocking.
The Blackwell-generation NVLink Switch chip has 64 NVLink ports. Each port runs at the NVLink 4.0 or 5.0 line rate depending on the generation. The switch fabric inside the chip is a crossbar that provides full bisection bandwidth — any input port can send to any output port at full line rate simultaneously. The key metric is that the switch introduces approximately 90 nanoseconds of additional forwarding latency versus a direct NVLink connection, which at 0.5 µs base link latency is a ~18% increase — acceptable for the bandwidth gain.
3.1 Why a dedicated ASIC and not a PCIe switch?
The decision to build a dedicated NVLink switching ASIC rather than using PCIe switches (which are cheaper and commodity) was driven by three requirements that PCIe switches cannot satisfy simultaneously:
Coherence maintenance at switch speed. NVLink's cache-coherence protocol requires the switch to participate in coherence transactions — it cannot simply forward packets blindly. A PCIe switch has no coherence logic and cannot be made to forward coherence probes correctly without CPU involvement.
No host CPU involvement. PCIe switches route through the root complex — the host CPU's PCIe controller. Every GPU-to-GPU message crosses the CPU memory controller at least in the control plane. NVLink Switch routes entirely in the data plane without CPU involvement, which is why GPU-to-GPU transfers over NVLink have zero CPU overhead.
Fine-grained flow control. NVLink's 32-byte flit-level flow control must be maintained end-to-end across the switch. PCIe's coarser flow control model (TLP-granular) cannot be retrofitted into NVLink without breaking the small-message latency properties.
4. How NVL72 builds a non-blocking fabric from switch ASICs
GB300 NVL72 connects 72 B200 GPUs using a two-tier switch topology built from NVLink Switch ASICs. Understanding why it is non-blocking requires understanding the specific wiring pattern.
NVL72 topology structure (GB300)72 GPUs × 18 NVLink ports each = 1,296 total GPU NVLink ports
Tier 1: Leaf switches (9 switches × 64 ports)
- Each leaf switch connects to 8 GPUs (8 ports used as "downlinks")
- Each leaf switch connects to all 9 other leaf switches...
wait — that's not right for non-blocking.
Actual NVL72 wiring:
72 GPUs are organized into 9 groups of 8 GPUs
Each group of 8 GPUs connects to one "leaf" NVLink Switch
→ 9 leaf switches, each with 8 "GPU-facing" ports
Each leaf switch also connects to all other 8 leaf switches
via dedicated inter-switch links (the "spine")
→ 8 inter-switch ports per leaf switch
A GPU-to-GPU message within the same group: 1 hop (leaf switch only)
A GPU-to-GPU message across groups: 2 hops (source leaf → dest leaf)
Non-blocking guarantee:
Each leaf switch has 8 GPU-facing ports at X GB/s each
Each leaf switch has 8 inter-switch ports at X GB/s each
→ Inter-switch capacity = GPU-facing capacity (1:1, no oversubscription)
→ Any traffic matrix is achievable: TRUE non-blocking
The non-blocking guarantee is not marketing language — it is a mathematical property of the topology. When inter-switch bandwidth exactly equals GPU-facing bandwidth (1:1 ratio), any permutation traffic matrix — any combination of which GPU is sending to which other GPU — can be satisfied simultaneously at full line rate. This is the formal definition of non-blocking.
For comparison: most data-center Fat-tree deployments use 2:1 or 4:1 oversubscription at the aggregation layer, meaning half to one-quarter of the theoretical bisection bandwidth is actually available under adversarial traffic. NVL72's non-blocking guarantee means that all-reduce operations across all 72 GPUs proceed at the full line rate of the lowest-bandwidth GPU link — there is no "unlucky" traffic pattern that degrades collective performance.
5. The bandwidth math: where 1.8 TB/s per GPU comes from
Per-GPU bandwidth derivation — GB300 NVL72B200 GPU NVLink configuration:
NVLink 5.0 per link: ~100 GB/s bidirectional (vs 50 GB/s for NVLink 4.0)
Links per B200: 18
Total GPU NVLink BW: 18 × 100 GB/s = 1,800 GB/s = 1.8 TB/s bidirectional
Within same 8-GPU leaf group (1-hop):
Available bandwidth: limited by GPU NVLink port count and leaf switch capacity
Effective: ~1.8 TB/s bidirectional to any peer in the group
Cross-group (2-hop):
Bandwidth: limited by inter-switch link capacity
In non-blocking config: same 1.8 TB/s bidirectional to any GPU in fabric
Total fabric aggregate:
72 GPUs × 1.8 TB/s / 2 (bidirectional counted once) = 64.8 TB/s
Bisection bandwidth: ~32.4 TB/s (half of aggregate for any partition)
Compare to: H100 SXM5 NVL8 (8-GPU NVSwitch domain)
NVLink 4.0: 18 × 50 GB/s = 900 GB/s per GPU
NVL72 is 2× per-GPU BW and 9× larger GPU domain
The 1.8 TB/s number is remarkable context: HBM3e on H100 provides 3.35 TB/s of memory bandwidth. NVLink bandwidth in NVL72 is roughly 54% of local HBM bandwidth. This ratio — NVLink bandwidth / HBM bandwidth — is the key parameter for distributed computation: if it is close to 1, you can communicate between GPUs nearly as fast as you can read from local memory, which means communication is not the binding constraint for well-designed parallel algorithms. Reaching 54% of HBM bandwidth over the network fabric is a significant engineering achievement.
6. NVLink Switch vs. InfiniBand for scale-up: not the same problem
A common question is: why not just use InfiniBand 400G/800G for scale-up between GPUs? It is cheaper, available from multiple vendors, and well understood. The answer reveals important architectural constraints.
| Property | NVLink Switch (NVL72) | InfiniBand NDR/XDR |
|---|---|---|
| Cache coherence | Yes — GPU load/store to remote HBM | No — RDMA only, no coherence |
| CPU involvement | Zero for data plane | Zero for data plane (RDMA), but coherence requires CPU |
| Bandwidth per GPU | 1.8 TB/s (NVL72) | 25-50 GB/s (2-4 ports × NDR 200Gbps) |
| Latency | ~0.6 µs end-to-end (1 hop) | ~1.5-2 µs MPI-level |
| Scale | 72 GPUs max per domain | Thousands of nodes |
| Protocol | Proprietary NVLink | Open InfiniBand / RoCEv2 |
| Primary use | Scale-up: tensor parallel within a job | Scale-out: data parallel across nodes |
InfiniBand and NVLink Switch solve fundamentally different problems. NVLink Switch provides the tight, coherent, very-high-bandwidth fabric needed for tensor parallelism — where a single transformer layer's computation is split across multiple GPUs and requires continuous high-bandwidth exchange of activation tensors between every layer. InfiniBand provides the broader, lower-bandwidth fabric needed for pipeline and data parallelism — where gradient updates or model shards need to be synchronized across hundreds or thousands of GPUs.
The scale ceiling of NVLink Switch (72 GPUs in current generation) is not a limitation — it is a design choice. Beyond 72 GPUs, the cross-switch link count grows as O(N²), making the all-to-all non-blocking fabric prohibitively expensive. At 72 GPUs, the fabric cost is justified by the use cases (a 405B parameter model fits in 72 B200 GPUs with tensor parallelism). At 720 GPUs, non-blocking all-to-all NVLink would require 100× more switch silicon — InfiniBand is the right tool at that scale.
7. Failure domains and what non-blocking means under fault
Non-blocking is a bandwidth guarantee under normal operation. What happens when components fail is a separate question that the "non-blocking" label does not answer.
In NVL72, there are three distinct failure modes with different blast radii:
Single NVLink port failure (GPU side). A GPU has 18 NVLink ports. If one port fails, that GPU loses 1/18 of its NVLink bandwidth. The fabric degrades gracefully — 17/18 of the bandwidth remains available. The GPU remains usable and the job continues with reduced bandwidth. NCCL will detect the topology change through its bandwidth measurement and generate a non-optimal but functional communication plan.
Single NVLink Switch chip failure. One leaf switch serves 8 GPUs. If that switch fails, those 8 GPUs lose all NVLink connectivity — they are isolated from the rest of the fabric. The job running across the NVL72 domain will stall or fail, depending on the checkpoint strategy. This is an 8-GPU blast radius: 11% of the domain.
Inter-switch link failure. A failure in the inter-switch links between leaf switches does not isolate any GPU but breaks the non-blocking guarantee. The fabric becomes partially oversubscribed for cross-group traffic. Jobs that require uniform all-to-all bandwidth will run slower; jobs with locality-aware collective communication plans may be unaffected.
NVL72 does not provide the same failure independence as a distributed InfiniBand cluster. In an IB cluster, a single switch failure isolates a leaf group but the rest of the fabric routes around it. In NVL72, the tight coupling of the non-blocking fabric means that a leaf switch failure forces a job restart across the entire domain. For long-running training jobs on NVL72, checkpoint frequency is determined by switch MTBF, not GPU MTBF.
8. NCCL over NVLink Switch: how the software layer sees the fabric
NCCL (NVIDIA Collective Communications Library) is the primary software layer that uses NVLink Switch. Understanding how NCCL maps collective communication algorithms onto the NVLink topology explains why topology-awareness in the software matters even for a "non-blocking" fabric.
NCCL performs topology discovery at initialization using NVIDIA's topology detection APIs, which walk the system's device tree to identify which GPUs are connected through which switches. For NVL72, NCCL discovers the two-tier topology and uses it to generate ring or tree-based collective algorithms that respect the topology.
NCCL topology awareness — NVL72 all-reduce patternNaive all-reduce (topology-unaware):
Ring order: GPU0 → GPU1 → GPU2 → ... → GPU71 → GPU0
Problem: many steps cross leaf-switch boundaries (2-hop paths)
Effective BW: limited by inter-switch links utilized in arbitrary order
NCCL topology-aware all-reduce:
Phase 1: Intra-group reduce-scatter (within each 8-GPU leaf group)
All traffic is 1-hop (within leaf switch)
8 groups execute in parallel → 8× parallelism
Phase 2: Inter-group all-reduce (one representative per group)
Only 9 GPUs participate; traffic crosses leaf-switch boundaries
Small volume (1/8 of original) × inter-switch BW
Phase 3: Intra-group all-gather (broadcast result within each group)
Again all 1-hop, 8 groups parallel
Result: 7/8 of traffic never crosses leaf-switch boundary
Inter-switch links carry only 1/8 of total data
Effective BW: ~15% higher than naive ring for 72-GPU all-reduce
The 15% improvement from topology-aware collectives is not dramatic, but it compounds with scale. The broader lesson is that non-blocking hardware topology does not eliminate the value of topology-aware software — it changes the optimization target from "avoid congestion" to "maximize locality and parallelism."
9. The scale-up / scale-out boundary: where NVLink ends and the problem begins
NVL72 is a scale-up fabric: it allows a single model instance to use up to 72 GPUs as if they were a single tightly coupled compute unit. Beyond 72 GPUs — for data parallelism at scale, pipeline parallelism across many stages, or multi-job multi-tenant clusters — the fabric transitions to InfiniBand or Ethernet. This boundary is where the most interesting engineering problems live.
The transition point between scale-up (NVLink) and scale-out (IB/Ethernet) is not just a bandwidth discontinuity. It is a coherence boundary. Within the NVLink domain, GPU memories participate in a shared coherent address space. Across the IB boundary, GPUs are isolated memory islands that communicate through explicit RDMA operations. Software must be aware of this boundary: operations that cross it require explicit buffers, explicit synchronization, and explicit data copies. Operations that stay within it can use simpler load/store semantics.
For production inference clusters running disaggregated prefill-decode serving, this boundary creates a design choice: should prefill and decode pools for a single model be within the same NVL72 domain (tight coupling, high bandwidth, shared failure domain) or in different NVL72 domains connected by IB (loose coupling, lower bandwidth, independent failure)? The answer depends on the model size relative to 72 GPUs and the operator's tolerance for correlated failure.
10. What NVLink Switch teaches us about the future of AI interconnect
NVLink Switch is a bet on a specific architectural philosophy: the most important communication for AI workloads happens within a tight group of GPUs running a single model, and that communication should have a dedicated, proprietary, high-bandwidth, coherent fabric that is architecturally distinct from the general-purpose cluster network.
The counterargument — that open standards (InfiniBand, Ultra Ethernet) will converge on the same bandwidth over time and replace NVLink — misses the coherence dimension. InfiniBand provides high bandwidth but not cache coherence. Coherence requires the interconnect to participate in the GPU's memory protocol, which means it cannot be a commodity component designed independently of the GPU. As long as cache coherence provides programming model advantages for tensor-parallel AI workloads, NVLink Switch's proprietary nature is a feature, not a lock-in risk.
The trajectory is clear: scale-up fabrics will get tighter (more GPUs per non-blocking domain, higher bandwidth per GPU), and scale-out fabrics will get broader (more nodes, higher total bandwidth, lower cost per bit). The boundary between them — and how software manages the coherence discontinuity at that boundary — is where the next generation of AI infrastructure systems problems will be solved.