← All posts
TPU Architecture · Memory Design · Systems Economics

Why TPUs Have Less HBM Than GPUs

Published · 6 min read

And why that's a design choice, not a limitation. NVIDIA bets on fat nodes with maximum per-GPU memory. Google bets on thin nodes with maximum inter-chip bandwidth. Both are rational — they optimize for different cost functions.

By Manish KL~15 min readTechnical Essay
Abstract

An H100 has 80 GB of HBM. A B200 has 192 GB. A TPU v6e (Trillium) has 32 GB — less than half an H100. This looks like Google shipping inferior hardware. It is not. It is a deliberate architectural bet: TPU pods are designed as distributed memory systems from the ground up, where model parallelism across thousands of chips is the default. A single chip does not need to hold the model — the pod does. The ICI interconnect is fast enough (1.2 TB/s on v5p) that cross-chip memory access approaches local HBM speed. Google trades per-chip HBM for more chips at lower per-chip cost, using the torus interconnect to make the aggregate memory feel like a single, very large pool. This is the opposite of NVIDIA's strategy — and it creates a different cost/performance frontier.

32 GB
HBM per TPU v6e chip (Trillium)
192 GB
HBM per B200 GPU — 6× more than Trillium
851 TB
aggregate HBM in a 8,960-chip v5p pod (95 GB × 8,960)
1.2 TB/s
ICI bandwidth per v5p chip — fast enough to make remote memory viable

Fat nodes vs. thin nodes

The GPU and TPU ecosystems represent two fundamentally different strategies for distributing memory across a training cluster:

Fat Nodes vs. Thin Nodes: Two Memory Strategies NVIDIA: Fat Nodes maximize per-GPU memory, minimize cross-GPU communication B200 192 GB HBM 4.8 TB/s BW B200 192 GB HBM ×8 per node 8 GPUs × 192 GB = 1.5 TB per node strategy: fit more model per GPU, reduce parallelism interconnect: NVLink (intra-node) + IB (inter-node) few fat nodes, less communication GOOGLE: Thin Nodes minimize per-chip memory, maximize inter-chip bandwidth 32G 32G 32G 32G ×8,960 chips 8,960 chips × 95 GB = 851 TB per pod strategy: distribute everything, use interconnect as memory interconnect: ICI 3D torus (1.2 TB/s/chip) many thin nodes, fast communication
Figure 1. NVIDIA maximizes per-GPU HBM to reduce the need for cross-GPU communication. Google minimizes per-chip HBM and maximizes inter-chip bandwidth, treating the pod's aggregate memory as a single distributed pool.

The economics behind thin nodes

HBM is the most expensive component on a modern accelerator. On an H100, HBM accounts for roughly 30–40% of the chip's bill-of-materials cost. On a B200 with 192 GB, the HBM cost is even higher. Every gigabyte of HBM adds ~$10–25 to the chip cost.

Google's strategy is a cost trade: instead of spending $2,000–4,000 on HBM per chip, spend less on HBM and invest in interconnect bandwidth. The ICI links that connect TPU chips within a pod are custom silicon — but they are cheaper to scale than HBM, because interconnect bandwidth scales with the number of links (wires and fibers), while HBM capacity scales with chip stacking (which has yield and thermal limits).

The cost equation: For a 405B model (810 GB of weights at fp16), NVIDIA needs 5 B200 GPUs (5 × 192 GB = 960 GB) or 11 H100 GPUs (11 × 80 GB = 880 GB). Google needs ~9 v5p chips (9 × 95 GB = 855 GB) or ~26 v6e chips (26 × 32 GB = 832 GB). The number of chips is similar — but the per-chip cost is different because HBM is the dominant cost driver.

When the interconnect is fast enough, remote memory is local memory

The key assumption behind the thin-node strategy is that ICI bandwidth is fast enough to make cross-chip memory access indistinguishable from local HBM access for the workloads that matter.

On a v5p chip, the ICI provides 1.2 TB/s of bidirectional bandwidth — roughly 43% of the chip's HBM bandwidth (2.8 TB/s). For all-reduce operations (which dominate training communication), this ratio is sufficient: the gradient reduction step can proceed at ICI speed without bottlenecking the training loop, because the next forward pass overlaps with the communication.

For weight shards accessed via tensor parallelism, the ICI latency (~1–2 μs for nearest-neighbor access) is low enough that the MXU can be kept busy while waiting for remote weight tiles — especially with the software-managed memory system's ability to prefetch tiles via DMA.

What this enables: the pod as a single logical accelerator

The practical consequence of the thin-node + fast-interconnect design is that a TPU pod behaves more like a single, very large accelerator than like a cluster of independent machines. The aggregate memory of a v5p pod (851 TB across 8,960 chips) is treated by the XLA compiler as a single distributed memory space. The compiler shards weights, activations, and optimizer state across the pod's chips using GSPMD (Generalized SPMD) partitioning, and generates the ICI communication patterns needed to stitch the computation together.

This is fundamentally different from the GPU model, where each GPU is programmed independently (via NCCL, DeepSpeed, or Megatron-LM) and the programmer must manually manage parallelism strategies. On a TPU pod, the compiler handles the distribution — the programmer writes single-device code, and XLA partitions it across the pod.

Where fat nodes genuinely win

The thin-node strategy has real limitations, and the GPU fat-node approach is genuinely better in several scenarios:

The convergence question

Both strategies are evolving toward each other. NVIDIA is increasing interconnect bandwidth (NVLink 5.0 provides 1.8 TB/s bidirectional) while maintaining large per-GPU HBM. Google is increasing per-chip HBM (v5p has 95 GB, up from v4's 32 GB) while maintaining fast ICI. The question is whether they converge on a single design point or whether the optimal point remains workload-dependent.

The answer is likely workload-dependent. Training frontier models — where the model is too large for any single chip and must be distributed across hundreds or thousands of chips — favors the thin-node approach: more chips, more aggregate memory, faster interconnect. Inference serving — where per-request memory capacity determines concurrency — favors the fat-node approach: more HBM per chip, less reliance on cross-chip communication for latency-sensitive token generation.

WorkloadFavors Fat Nodes (GPU)Favors Thin Nodes (TPU)
Frontier model training — model is distributed regardless; more chips at lower cost wins
Inference serving (high concurrency) — per-GPU KV-cache capacity determines throughput
Fine-tuning (small-scale) — fits on 1–4 GPUs without distribution
Large-batch research training — pod-scale compiler-managed distribution is simpler
Multi-tenant serving — each GPU serves independently
Less HBM is not inferior hardware. It is a different design bet: that aggregate memory across a fast interconnect is cheaper and more scalable than per-chip memory behind a slow one. Whether that bet pays off depends on the workload — and for the workloads Google cares about most, it does.