Optical Circuit Switching and the 3D Torus: Why TPU Pods Are a Different Kind of Supercomputer

Abstract

NVIDIA's GPU clusters are built on a layered network hierarchy: NVLink within nodes, NVSwitch across nodes in a domain, InfiniBand (or RoCE) across domains — all using electrical packet switching with store-and-forward semantics. Google's TPU pods use something architecturally different: a 3D torus topology wired with Optical Circuit Switching (OCS). OCS uses MEMS (micro-electro-mechanical) mirrors to physically redirect optical fibers — no packet switching, no routing tables, no store-and-forward latency. The topology is reconfigured at job start time, so a 4,096-chip pod can be partitioned into arbitrary sub-torus slices matched to each training job's parallelism dimensions. This essay examines why OCS + torus is a fundamentally different networking philosophy, where it wins, and where the GPU approach is genuinely better.

Two philosophies of interconnect

The GPU datacenter and the TPU pod solve the same problem — connecting thousands of accelerators with enough bandwidth for all-reduce, all-gather, and point-to-point transfers — but with fundamentally different networking primitives.

Property	GPU Cluster (NVIDIA)	TPU Pod (Google)
Intra-node	NVLink 4.0/5.0 (900 GB/s bidirectional per GPU)	ICI copper links within a "cube" of 64 chips
Inter-node	InfiniBand NDR/XDR or RoCE via electrical switches	Optical Circuit Switching (OCS) via MEMS mirrors
Topology	Fat-tree / Clos (multi-stage switch hierarchy)	3D torus (direct topology, nearest-neighbor links)
Switching	Store-and-forward packet switching	Direct fiber path — no switching delay
Reconfigurability	Software routing (VLANs, adaptive routing)	Physical topology reconfiguration at job start
Power per switch	~500–1000W per InfiniBand switch ASIC	~25–50W per OCS unit (MEMS actuators)

What Optical Circuit Switching actually does

A conventional electrical switch receives a packet, buffers it, reads the header, looks up a routing table, and forwards it to the correct output port. This happens billions of times per second, consuming significant power and adding store-and-forward latency at each hop.

An OCS unit is radically simpler. It contains an array of MEMS mirrors — tiny mechanical mirrors that tilt to redirect an incoming optical fiber to a specific output fiber. There is no packet processing, no buffering, no routing lookup. Once configured, the optical path is a direct, dedicated fiber connection between two endpoints — as if you ran a physical patch cable between them. The latency is the speed of light through the fiber, with zero switching overhead.

Figure 1. Packet switching buffers and routes at every hop. OCS provides a direct optical path with zero switching overhead — but can only be reconfigured at job boundaries, not per-packet.

The 3D torus: a direct topology

GPU clusters use indirect topologies — fat-trees and Clos networks where traffic passes through intermediate switch layers. These are general-purpose: any node can reach any other node with bounded hop count, and the topology does not need to match the computation.

TPU pods use a direct topology: a 3D torus where each chip has physical links to its six nearest neighbors (±x, ±y, ±z). There are no intermediate switches — data travels through the chips themselves. On a v5p, 8,960 chips form a 16×20×28 3D torus.

The 3D torus has a critical property: its communication patterns map naturally onto the parallelism dimensions of large model training. Data parallelism maps to one axis, tensor parallelism to another, pipeline parallelism to the third. The all-reduce operation — the dominant collective in training — maps to ring reductions along each torus dimension, utilizing the wrap-around links for optimal bandwidth.

The topology-workload alignment: In a 3D torus, all-reduce along one axis involves only the chips in that dimension's ring — typically 16–28 chips. The data moves in a ring pattern that utilizes 100% of the link bandwidth in each direction. In a fat-tree, all-reduce must traverse the switch hierarchy, competing with other traffic at each layer. The torus approach is less general but more efficient for the specific collectives that dominate training.

Reconfigurability: reshaping the network at job boundaries

The key innovation of OCS is that the topology is physically reconfigurable. At job start time, the OCS units tilt their MEMS mirrors to create the fiber paths needed for that job's torus dimensions. A 4,096-chip pod can be partitioned into four 1,024-chip sub-torus slices, or two 2,048-chip slices, or used as a single 4,096-chip torus — all by reconfiguring the OCS fiber paths without changing any cabling.

This is fundamentally different from GPU cluster scheduling, where the network topology is fixed and jobs are mapped onto it via software routing. In a TPU pod, the physical network reshapes itself to match the computation.

Where OCS + torus wins

Power efficiency: A single OCS unit consumes ~25–50W — 95% less than an electrical switch ASIC at ~500–1000W. At pod scale (hundreds of inter-rack connections), this saves megawatts of power and the corresponding cooling infrastructure.
Latency: Direct fiber paths with no store-and-forward delay. All-reduce latency is bounded by the fiber propagation time and the number of torus hops, not by switch buffering.
Bandwidth efficiency: No oversubscription ratios. In a fat-tree, the higher switch layers are often oversubscribed (e.g., 2:1 or 4:1). In a 3D torus, every link between adjacent chips operates at full ICI bandwidth — there is no shared switch layer to congest.
Collective performance: Ring all-reduce on torus dimensions achieves near-optimal bandwidth utilization. The v5p pod can sustain ~4 petabits per second of aggregate bisection bandwidth across the switching fabric.

Where the GPU approach is genuinely better

Any-to-any communication: Fat-tree topologies provide bounded latency between any two nodes. In a 3D torus, communication between distant chips (high diameter) requires traversing multiple intermediate chips, adding latency. For workloads with irregular communication patterns, the fat-tree is more predictable.
Multi-tenancy: OCS topology is set per job. A fat-tree can serve hundreds of independent jobs simultaneously without topology reconfiguration. TPU pods must be sliced at job boundaries, limiting scheduling flexibility.
Failure resilience: In a fat-tree, a failed switch can be routed around via alternative paths. In a 3D torus, a failed chip breaks the ring along that dimension — the wrap-around link is severed. Google addresses this by overdimensioning pods and using spare chips, but the failure semantics are harsher.
Vendor ecosystem: InfiniBand and RoCE have broad vendor support (Mellanox/NVIDIA, Arista, Broadcom). OCS is largely Google-internal technology — no third-party ecosystem, no competitive market.

The scaling trajectory

TPU Gen	ICI Topology	Pod Scale	ICI BW/chip	Key Advancement
v4	3D torus	4,096 chips	~300 GB/s	First OCS-wired 3D torus
v5p	3D torus (16×20×28)	8,960 chips	1.2 TB/s	4× ICI bandwidth, 4 Pb/s aggregate
v6e (Trillium)	2D torus	256 chips	800 Gbps	Efficiency-optimized, smaller pods
v7 (Ironwood)	3D torus	9,216 chips	TBD	800G OSFP optics, 1.6T upgrade path

The trajectory is clear: larger torus dimensions, higher per-link bandwidth, and a shift toward 800G and eventually 1.6T optical transceivers on the wrap-around links. The OCS layer scales with the number of inter-rack connections, not the total chip count — making it economically favorable as pods grow.

GPU clusters route packets through switches to achieve generality. TPU pods route light through mirrors to achieve efficiency. The tradeoff is the same one that runs through the entire TPU design: specialize for the workloads that matter, and accept reduced generality as the cost.

tpu-optical-circuit-switching-3d-torus.html · April 2026 · ← All writings