Why TPUs Have Less HBM Than GPUs — and Why That's a Design Choice, Not a Limitation

Abstract

An H100 has 80 GB of HBM. A B200 has 192 GB. A TPU v6e (Trillium) has 32 GB — less than half an H100. This looks like Google shipping inferior hardware. It is not. It is a deliberate architectural bet: TPU pods are designed as distributed memory systems from the ground up, where model parallelism across thousands of chips is the default. A single chip does not need to hold the model — the pod does. The ICI interconnect is fast enough (1.2 TB/s on v5p) that cross-chip memory access approaches local HBM speed. Google trades per-chip HBM for more chips at lower per-chip cost, using the torus interconnect to make the aggregate memory feel like a single, very large pool. This is the opposite of NVIDIA's strategy — and it creates a different cost/performance frontier.

Fat nodes vs. thin nodes

The GPU and TPU ecosystems represent two fundamentally different strategies for distributing memory across a training cluster:

Figure 1. NVIDIA maximizes per-GPU HBM to reduce the need for cross-GPU communication. Google minimizes per-chip HBM and maximizes inter-chip bandwidth, treating the pod's aggregate memory as a single distributed pool.

The economics behind thin nodes

HBM is the most expensive component on a modern accelerator. On an H100, HBM accounts for roughly 30–40% of the chip's bill-of-materials cost. On a B200 with 192 GB, the HBM cost is even higher. Every gigabyte of HBM adds ~$10–25 to the chip cost.

Google's strategy is a cost trade: instead of spending $2,000–4,000 on HBM per chip, spend less on HBM and invest in interconnect bandwidth. The ICI links that connect TPU chips within a pod are custom silicon — but they are cheaper to scale than HBM, because interconnect bandwidth scales with the number of links (wires and fibers), while HBM capacity scales with chip stacking (which has yield and thermal limits).

The cost equation: For a 405B model (810 GB of weights at fp16), NVIDIA needs 5 B200 GPUs (5 × 192 GB = 960 GB) or 11 H100 GPUs (11 × 80 GB = 880 GB). Google needs ~9 v5p chips (9 × 95 GB = 855 GB) or ~26 v6e chips (26 × 32 GB = 832 GB). The number of chips is similar — but the per-chip cost is different because HBM is the dominant cost driver.

When the interconnect is fast enough, remote memory is local memory

The key assumption behind the thin-node strategy is that ICI bandwidth is fast enough to make cross-chip memory access indistinguishable from local HBM access for the workloads that matter.

On a v5p chip, the ICI provides 1.2 TB/s of bidirectional bandwidth — roughly 43% of the chip's HBM bandwidth (2.8 TB/s). For all-reduce operations (which dominate training communication), this ratio is sufficient: the gradient reduction step can proceed at ICI speed without bottlenecking the training loop, because the next forward pass overlaps with the communication.

For weight shards accessed via tensor parallelism, the ICI latency (~1–2 μs for nearest-neighbor access) is low enough that the MXU can be kept busy while waiting for remote weight tiles — especially with the software-managed memory system's ability to prefetch tiles via DMA.

What this enables: the pod as a single logical accelerator

The practical consequence of the thin-node + fast-interconnect design is that a TPU pod behaves more like a single, very large accelerator than like a cluster of independent machines. The aggregate memory of a v5p pod (851 TB across 8,960 chips) is treated by the XLA compiler as a single distributed memory space. The compiler shards weights, activations, and optimizer state across the pod's chips using GSPMD (Generalized SPMD) partitioning, and generates the ICI communication patterns needed to stitch the computation together.

This is fundamentally different from the GPU model, where each GPU is programmed independently (via NCCL, DeepSpeed, or Megatron-LM) and the programmer must manually manage parallelism strategies. On a TPU pod, the compiler handles the distribution — the programmer writes single-device code, and XLA partitions it across the pod.

Where fat nodes genuinely win

The thin-node strategy has real limitations, and the GPU fat-node approach is genuinely better in several scenarios:

Inference with per-request state: Inference workloads create per-request KV caches that are not easily shareable across chips. More HBM per chip means more concurrent KV caches, which directly increases inference throughput. This is why NVIDIA's B200 pushed to 192 GB — inference serving benefits enormously from local memory capacity.
Small-batch fine-tuning: Fine-tuning tasks that fit on a small number of GPUs don't benefit from pod-scale distribution. A single B200 with 192 GB can fine-tune a 70B model without any parallelism — simpler, faster, and more cost-effective than distributing across multiple TPU chips.
Heterogeneous workloads: GPU clusters can run many independent jobs simultaneously, each using a small number of GPUs. TPU pods are designed for single large jobs that occupy the entire pod or a large slice. Multi-tenant, heterogeneous serving is harder on the thin-node model.
Ecosystem and tooling: The entire deep learning ecosystem (PyTorch, CUDA, NCCL) is built around the fat-node model. Distributed training frameworks assume large per-GPU memory and relatively slow inter-GPU communication. TPU's thin-node model requires different software (JAX, XLA, GSPMD), which narrows the developer pool.

The convergence question

Both strategies are evolving toward each other. NVIDIA is increasing interconnect bandwidth (NVLink 5.0 provides 1.8 TB/s bidirectional) while maintaining large per-GPU HBM. Google is increasing per-chip HBM (v5p has 95 GB, up from v4's 32 GB) while maintaining fast ICI. The question is whether they converge on a single design point or whether the optimal point remains workload-dependent.

The answer is likely workload-dependent. Training frontier models — where the model is too large for any single chip and must be distributed across hundreds or thousands of chips — favors the thin-node approach: more chips, more aggregate memory, faster interconnect. Inference serving — where per-request memory capacity determines concurrency — favors the fat-node approach: more HBM per chip, less reliance on cross-chip communication for latency-sensitive token generation.

Workload	Favors Fat Nodes (GPU)	Favors Thin Nodes (TPU)
Frontier model training		✓ — model is distributed regardless; more chips at lower cost wins
Inference serving (high concurrency)	✓ — per-GPU KV-cache capacity determines throughput
Fine-tuning (small-scale)	✓ — fits on 1–4 GPUs without distribution
Large-batch research training		✓ — pod-scale compiler-managed distribution is simpler
Multi-tenant serving	✓ — each GPU serves independently

Less HBM is not inferior hardware. It is a different design bet: that aggregate memory across a fast interconnect is cheaper and more scalable than per-chip memory behind a slow one. Whether that bet pays off depends on the workload — and for the workloads Google cares about most, it does.

tpu-less-hbm-design-choice-not-limitation.html · April 2026 · ← All writings