← All posts
TPU Architecture · Interconnects · Optical Networking

Optical Circuit Switching and the 3D Torus

Published · 7 min read

GPU clusters route packets through electrical switches. TPU pods route light through MEMS mirrors. The physical network reshapes itself to match the computation's communication pattern.

By Manish KL~16 min readTechnical Essay
Abstract

NVIDIA's GPU clusters are built on a layered network hierarchy: NVLink within nodes, NVSwitch across nodes in a domain, InfiniBand (or RoCE) across domains — all using electrical packet switching with store-and-forward semantics. Google's TPU pods use something architecturally different: a 3D torus topology wired with Optical Circuit Switching (OCS). OCS uses MEMS (micro-electro-mechanical) mirrors to physically redirect optical fibers — no packet switching, no routing tables, no store-and-forward latency. The topology is reconfigured at job start time, so a 4,096-chip pod can be partitioned into arbitrary sub-torus slices matched to each training job's parallelism dimensions. This essay examines why OCS + torus is a fundamentally different networking philosophy, where it wins, and where the GPU approach is genuinely better.

4 Pb/s
aggregate bisection bandwidth across a v5p pod switching fabric
95%
power reduction vs. electrical packet switching per OCS unit
8,960
chips in a single TPU v5p pod (3D torus: 16×20×28)
1.2 TB/s
bidirectional ICI bandwidth per TPU v5p chip

Two philosophies of interconnect

The GPU datacenter and the TPU pod solve the same problem — connecting thousands of accelerators with enough bandwidth for all-reduce, all-gather, and point-to-point transfers — but with fundamentally different networking primitives.

PropertyGPU Cluster (NVIDIA)TPU Pod (Google)
Intra-nodeNVLink 4.0/5.0 (900 GB/s bidirectional per GPU)ICI copper links within a "cube" of 64 chips
Inter-nodeInfiniBand NDR/XDR or RoCE via electrical switchesOptical Circuit Switching (OCS) via MEMS mirrors
TopologyFat-tree / Clos (multi-stage switch hierarchy)3D torus (direct topology, nearest-neighbor links)
SwitchingStore-and-forward packet switchingDirect fiber path — no switching delay
ReconfigurabilitySoftware routing (VLANs, adaptive routing)Physical topology reconfiguration at job start
Power per switch~500–1000W per InfiniBand switch ASIC~25–50W per OCS unit (MEMS actuators)

What Optical Circuit Switching actually does

A conventional electrical switch receives a packet, buffers it, reads the header, looks up a routing table, and forwards it to the correct output port. This happens billions of times per second, consuming significant power and adding store-and-forward latency at each hop.

An OCS unit is radically simpler. It contains an array of MEMS mirrors — tiny mechanical mirrors that tilt to redirect an incoming optical fiber to a specific output fiber. There is no packet processing, no buffering, no routing lookup. Once configured, the optical path is a direct, dedicated fiber connection between two endpoints — as if you ran a physical patch cable between them. The latency is the speed of light through the fiber, with zero switching overhead.

Packet Switching vs. Optical Circuit Switching ELECTRICAL PACKET SWITCHING GPU A Switch L1 buffer → lookup → fwd Switch L2 buffer → lookup → fwd GPU B latency: ~1–5 μs per hop (buffering) power: ~500–1000W per switch ASIC any-to-any routing, general purpose flexible ✓ efficient ✗ OPTICAL CIRCUIT SWITCHING TPU A OCS: MEMS Mirror Array fiber in → mirror tilt → fiber out TPU B no buffering • no routing table • no packet headers latency: speed-of-light fiber (~5 ns/m) power: ~25–50W (MEMS actuators only) dedicated path, reconfigured per job efficient ✓ flexible only at job boundaries
Figure 1. Packet switching buffers and routes at every hop. OCS provides a direct optical path with zero switching overhead — but can only be reconfigured at job boundaries, not per-packet.

The 3D torus: a direct topology

GPU clusters use indirect topologies — fat-trees and Clos networks where traffic passes through intermediate switch layers. These are general-purpose: any node can reach any other node with bounded hop count, and the topology does not need to match the computation.

TPU pods use a direct topology: a 3D torus where each chip has physical links to its six nearest neighbors (±x, ±y, ±z). There are no intermediate switches — data travels through the chips themselves. On a v5p, 8,960 chips form a 16×20×28 3D torus.

The 3D torus has a critical property: its communication patterns map naturally onto the parallelism dimensions of large model training. Data parallelism maps to one axis, tensor parallelism to another, pipeline parallelism to the third. The all-reduce operation — the dominant collective in training — maps to ring reductions along each torus dimension, utilizing the wrap-around links for optimal bandwidth.

The topology-workload alignment: In a 3D torus, all-reduce along one axis involves only the chips in that dimension's ring — typically 16–28 chips. The data moves in a ring pattern that utilizes 100% of the link bandwidth in each direction. In a fat-tree, all-reduce must traverse the switch hierarchy, competing with other traffic at each layer. The torus approach is less general but more efficient for the specific collectives that dominate training.

Reconfigurability: reshaping the network at job boundaries

The key innovation of OCS is that the topology is physically reconfigurable. At job start time, the OCS units tilt their MEMS mirrors to create the fiber paths needed for that job's torus dimensions. A 4,096-chip pod can be partitioned into four 1,024-chip sub-torus slices, or two 2,048-chip slices, or used as a single 4,096-chip torus — all by reconfiguring the OCS fiber paths without changing any cabling.

This is fundamentally different from GPU cluster scheduling, where the network topology is fixed and jobs are mapped onto it via software routing. In a TPU pod, the physical network reshapes itself to match the computation.

Where OCS + torus wins

Where the GPU approach is genuinely better

The scaling trajectory

TPU GenICI TopologyPod ScaleICI BW/chipKey Advancement
v43D torus4,096 chips~300 GB/sFirst OCS-wired 3D torus
v5p3D torus (16×20×28)8,960 chips1.2 TB/s4× ICI bandwidth, 4 Pb/s aggregate
v6e (Trillium)2D torus256 chips800 GbpsEfficiency-optimized, smaller pods
v7 (Ironwood)3D torus9,216 chipsTBD800G OSFP optics, 1.6T upgrade path

The trajectory is clear: larger torus dimensions, higher per-link bandwidth, and a shift toward 800G and eventually 1.6T optical transceivers on the wrap-around links. The OCS layer scales with the number of inter-rack connections, not the total chip count — making it economically favorable as pods grow.

GPU clusters route packets through switches to achieve generality. TPU pods route light through mirrors to achieve efficiency. The tradeoff is the same one that runs through the entire TPU design: specialize for the workloads that matter, and accept reduced generality as the cost.