Fat nodes vs. thin nodes
The GPU and TPU ecosystems represent two fundamentally different strategies for distributing memory across a training cluster:
The economics behind thin nodes
HBM is the most expensive component on a modern accelerator. On an H100, HBM accounts for roughly 30–40% of the chip's bill-of-materials cost. On a B200 with 192 GB, the HBM cost is even higher. Every gigabyte of HBM adds ~$10–25 to the chip cost.
Google's strategy is a cost trade: instead of spending $2,000–4,000 on HBM per chip, spend less on HBM and invest in interconnect bandwidth. The ICI links that connect TPU chips within a pod are custom silicon — but they are cheaper to scale than HBM, because interconnect bandwidth scales with the number of links (wires and fibers), while HBM capacity scales with chip stacking (which has yield and thermal limits).
When the interconnect is fast enough, remote memory is local memory
The key assumption behind the thin-node strategy is that ICI bandwidth is fast enough to make cross-chip memory access indistinguishable from local HBM access for the workloads that matter.
On a v5p chip, the ICI provides 1.2 TB/s of bidirectional bandwidth — roughly 43% of the chip's HBM bandwidth (2.8 TB/s). For all-reduce operations (which dominate training communication), this ratio is sufficient: the gradient reduction step can proceed at ICI speed without bottlenecking the training loop, because the next forward pass overlaps with the communication.
For weight shards accessed via tensor parallelism, the ICI latency (~1–2 μs for nearest-neighbor access) is low enough that the MXU can be kept busy while waiting for remote weight tiles — especially with the software-managed memory system's ability to prefetch tiles via DMA.
What this enables: the pod as a single logical accelerator
The practical consequence of the thin-node + fast-interconnect design is that a TPU pod behaves more like a single, very large accelerator than like a cluster of independent machines. The aggregate memory of a v5p pod (851 TB across 8,960 chips) is treated by the XLA compiler as a single distributed memory space. The compiler shards weights, activations, and optimizer state across the pod's chips using GSPMD (Generalized SPMD) partitioning, and generates the ICI communication patterns needed to stitch the computation together.
This is fundamentally different from the GPU model, where each GPU is programmed independently (via NCCL, DeepSpeed, or Megatron-LM) and the programmer must manually manage parallelism strategies. On a TPU pod, the compiler handles the distribution — the programmer writes single-device code, and XLA partitions it across the pod.
Where fat nodes genuinely win
The thin-node strategy has real limitations, and the GPU fat-node approach is genuinely better in several scenarios:
- Inference with per-request state: Inference workloads create per-request KV caches that are not easily shareable across chips. More HBM per chip means more concurrent KV caches, which directly increases inference throughput. This is why NVIDIA's B200 pushed to 192 GB — inference serving benefits enormously from local memory capacity.
- Small-batch fine-tuning: Fine-tuning tasks that fit on a small number of GPUs don't benefit from pod-scale distribution. A single B200 with 192 GB can fine-tune a 70B model without any parallelism — simpler, faster, and more cost-effective than distributing across multiple TPU chips.
- Heterogeneous workloads: GPU clusters can run many independent jobs simultaneously, each using a small number of GPUs. TPU pods are designed for single large jobs that occupy the entire pod or a large slice. Multi-tenant, heterogeneous serving is harder on the thin-node model.
- Ecosystem and tooling: The entire deep learning ecosystem (PyTorch, CUDA, NCCL) is built around the fat-node model. Distributed training frameworks assume large per-GPU memory and relatively slow inter-GPU communication. TPU's thin-node model requires different software (JAX, XLA, GSPMD), which narrows the developer pool.
The convergence question
Both strategies are evolving toward each other. NVIDIA is increasing interconnect bandwidth (NVLink 5.0 provides 1.8 TB/s bidirectional) while maintaining large per-GPU HBM. Google is increasing per-chip HBM (v5p has 95 GB, up from v4's 32 GB) while maintaining fast ICI. The question is whether they converge on a single design point or whether the optimal point remains workload-dependent.
The answer is likely workload-dependent. Training frontier models — where the model is too large for any single chip and must be distributed across hundreds or thousands of chips — favors the thin-node approach: more chips, more aggregate memory, faster interconnect. Inference serving — where per-request memory capacity determines concurrency — favors the fat-node approach: more HBM per chip, less reliance on cross-chip communication for latency-sensitive token generation.
| Workload | Favors Fat Nodes (GPU) | Favors Thin Nodes (TPU) |
|---|---|---|
| Frontier model training | ✓ — model is distributed regardless; more chips at lower cost wins | |
| Inference serving (high concurrency) | ✓ — per-GPU KV-cache capacity determines throughput | |
| Fine-tuning (small-scale) | ✓ — fits on 1–4 GPUs without distribution | |
| Large-batch research training | ✓ — pod-scale compiler-managed distribution is simpler | |
| Multi-tenant serving | ✓ — each GPU serves independently |
Less HBM is not inferior hardware. It is a different design bet: that aggregate memory across a fast interconnect is cheaper and more scalable than per-chip memory behind a slow one. Whether that bet pays off depends on the workload — and for the workloads Google cares about most, it does.