Two philosophies of interconnect
The GPU datacenter and the TPU pod solve the same problem — connecting thousands of accelerators with enough bandwidth for all-reduce, all-gather, and point-to-point transfers — but with fundamentally different networking primitives.
| Property | GPU Cluster (NVIDIA) | TPU Pod (Google) |
|---|---|---|
| Intra-node | NVLink 4.0/5.0 (900 GB/s bidirectional per GPU) | ICI copper links within a "cube" of 64 chips |
| Inter-node | InfiniBand NDR/XDR or RoCE via electrical switches | Optical Circuit Switching (OCS) via MEMS mirrors |
| Topology | Fat-tree / Clos (multi-stage switch hierarchy) | 3D torus (direct topology, nearest-neighbor links) |
| Switching | Store-and-forward packet switching | Direct fiber path — no switching delay |
| Reconfigurability | Software routing (VLANs, adaptive routing) | Physical topology reconfiguration at job start |
| Power per switch | ~500–1000W per InfiniBand switch ASIC | ~25–50W per OCS unit (MEMS actuators) |
What Optical Circuit Switching actually does
A conventional electrical switch receives a packet, buffers it, reads the header, looks up a routing table, and forwards it to the correct output port. This happens billions of times per second, consuming significant power and adding store-and-forward latency at each hop.
An OCS unit is radically simpler. It contains an array of MEMS mirrors — tiny mechanical mirrors that tilt to redirect an incoming optical fiber to a specific output fiber. There is no packet processing, no buffering, no routing lookup. Once configured, the optical path is a direct, dedicated fiber connection between two endpoints — as if you ran a physical patch cable between them. The latency is the speed of light through the fiber, with zero switching overhead.
The 3D torus: a direct topology
GPU clusters use indirect topologies — fat-trees and Clos networks where traffic passes through intermediate switch layers. These are general-purpose: any node can reach any other node with bounded hop count, and the topology does not need to match the computation.
TPU pods use a direct topology: a 3D torus where each chip has physical links to its six nearest neighbors (±x, ±y, ±z). There are no intermediate switches — data travels through the chips themselves. On a v5p, 8,960 chips form a 16×20×28 3D torus.
The 3D torus has a critical property: its communication patterns map naturally onto the parallelism dimensions of large model training. Data parallelism maps to one axis, tensor parallelism to another, pipeline parallelism to the third. The all-reduce operation — the dominant collective in training — maps to ring reductions along each torus dimension, utilizing the wrap-around links for optimal bandwidth.
Reconfigurability: reshaping the network at job boundaries
The key innovation of OCS is that the topology is physically reconfigurable. At job start time, the OCS units tilt their MEMS mirrors to create the fiber paths needed for that job's torus dimensions. A 4,096-chip pod can be partitioned into four 1,024-chip sub-torus slices, or two 2,048-chip slices, or used as a single 4,096-chip torus — all by reconfiguring the OCS fiber paths without changing any cabling.
This is fundamentally different from GPU cluster scheduling, where the network topology is fixed and jobs are mapped onto it via software routing. In a TPU pod, the physical network reshapes itself to match the computation.
Where OCS + torus wins
- Power efficiency: A single OCS unit consumes ~25–50W — 95% less than an electrical switch ASIC at ~500–1000W. At pod scale (hundreds of inter-rack connections), this saves megawatts of power and the corresponding cooling infrastructure.
- Latency: Direct fiber paths with no store-and-forward delay. All-reduce latency is bounded by the fiber propagation time and the number of torus hops, not by switch buffering.
- Bandwidth efficiency: No oversubscription ratios. In a fat-tree, the higher switch layers are often oversubscribed (e.g., 2:1 or 4:1). In a 3D torus, every link between adjacent chips operates at full ICI bandwidth — there is no shared switch layer to congest.
- Collective performance: Ring all-reduce on torus dimensions achieves near-optimal bandwidth utilization. The v5p pod can sustain ~4 petabits per second of aggregate bisection bandwidth across the switching fabric.
Where the GPU approach is genuinely better
- Any-to-any communication: Fat-tree topologies provide bounded latency between any two nodes. In a 3D torus, communication between distant chips (high diameter) requires traversing multiple intermediate chips, adding latency. For workloads with irregular communication patterns, the fat-tree is more predictable.
- Multi-tenancy: OCS topology is set per job. A fat-tree can serve hundreds of independent jobs simultaneously without topology reconfiguration. TPU pods must be sliced at job boundaries, limiting scheduling flexibility.
- Failure resilience: In a fat-tree, a failed switch can be routed around via alternative paths. In a 3D torus, a failed chip breaks the ring along that dimension — the wrap-around link is severed. Google addresses this by overdimensioning pods and using spare chips, but the failure semantics are harsher.
- Vendor ecosystem: InfiniBand and RoCE have broad vendor support (Mellanox/NVIDIA, Arista, Broadcom). OCS is largely Google-internal technology — no third-party ecosystem, no competitive market.
The scaling trajectory
| TPU Gen | ICI Topology | Pod Scale | ICI BW/chip | Key Advancement |
|---|---|---|---|---|
| v4 | 3D torus | 4,096 chips | ~300 GB/s | First OCS-wired 3D torus |
| v5p | 3D torus (16×20×28) | 8,960 chips | 1.2 TB/s | 4× ICI bandwidth, 4 Pb/s aggregate |
| v6e (Trillium) | 2D torus | 256 chips | 800 Gbps | Efficiency-optimized, smaller pods |
| v7 (Ironwood) | 3D torus | 9,216 chips | TBD | 800G OSFP optics, 1.6T upgrade path |
The trajectory is clear: larger torus dimensions, higher per-link bandwidth, and a shift toward 800G and eventually 1.6T optical transceivers on the wrap-around links. The OCS layer scales with the number of inter-rack connections, not the total chip count — making it economically favorable as pods grow.
GPU clusters route packets through switches to achieve generality. TPU pods route light through mirrors to achieve efficiency. The tradeoff is the same one that runs through the entire TPU design: specialize for the workloads that matter, and accept reduced generality as the cost.