Two philosophies of memory management
GPU and TPU architects face the same fundamental problem: the gap between compute speed and memory speed. Both need to keep arithmetic units fed with data. But they solve this problem in opposite ways.
What VMEM and CMEM actually are
On a GPU, "shared memory" is a programmer-visible scratchpad, but L1/L2 caches are hardware-managed and transparent. On a TPU, everything is a scratchpad. There is no transparent cache — every byte's location is determined by the compiler.
- VMEM (Vector Memory): A fast SRAM scratchpad local to each TensorCore. It holds the "hot" data that the MXU (systolic array) and vector unit consume. The XLA compiler issues explicit DMA commands to transfer data between HBM and VMEM before each computation begins. If the data isn't in VMEM when the MXU needs it, the systolic array stalls — there is no fallback "cache miss" path.
- CMEM (Common Memory): A shared SRAM pool accessible to all TensorCores on the chip. It serves as a producer-consumer buffer: one TensorCore writes partial results to CMEM, and another reads them, without round-tripping through HBM. This eliminates the inter-core communication overhead that GPUs handle via L2 cache or global memory.
Why this works beautifully for matrix multiplication
Dense matrix multiplication — the operation that dominates transformer training and inference — has perfectly predictable access patterns. The compiler knows, at compile time, exactly which tiles of weight matrices and activation matrices will be needed, in what order, and at what time. This makes it an ideal target for software-managed memory:
- Tile scheduling: The compiler partitions the matrix into tiles that fit in VMEM, then generates a DMA schedule that prefetches the next tile while the current one is being multiplied in the MXU. The systolic array never stalls because the next tile is always ready.
- Zero cache-miss overhead: There is no cache hierarchy to miss in. Data is either in VMEM (and available at SRAM speed) or in HBM (and will be DMA'd explicitly). The latency is deterministic — there is no variance from cache contention, eviction, or thrashing.
- No wasted capacity: Hardware caches inevitably hold stale data (cache lines that were loaded but won't be used again). Software-managed scratchpad eliminates this waste — every byte in VMEM was explicitly placed there because the compiler determined it would be needed.
Why this breaks for irregular workloads
The software-managed model has a critical assumption: access patterns must be knowable at compile time. When they are not — when the access pattern depends on runtime data — the compiler cannot generate an optimal DMA schedule. It must fall back to conservative strategies that dramatically reduce efficiency.
Embedding lookups: the anti-pattern
Recommendation models use massive embedding tables — often hundreds of gigabytes — where each training example accesses a sparse, data-dependent subset of the table. The indices are unknown until the input arrives. The compiler cannot prefetch what it cannot predict. On a GPU, the hardware cache absorbs some of this unpredictability — frequently accessed embeddings naturally stay in L2. On a TPU, with no hardware cache, every embedding access is a full-latency HBM read.
This is exactly why Google built SparseCore: a separate tiled dataflow processor with its own scratchpad (spMEM), designed specifically to handle the gather/scatter/reduce operations that the MXU + VMEM path cannot efficiently serve. SparseCore is, architecturally, an admission that software-managed memory has a boundary — and irregular workloads live on the other side of it.
The compiler is the memory controller
On a GPU, the memory controller is a hardware unit that decides, at runtime, which cache line to evict and which to keep. On a TPU, XLA is the memory controller. It makes the same decisions — what data to place where, when to prefetch, when to evict — but at compile time, statically, before the program runs.
This has profound implications for what workloads TPUs can serve efficiently:
| Workload Property | GPU (Hardware Cache) | TPU (Software Scratchpad) |
|---|---|---|
| Static, predictable access | Good — cache learns the pattern | Excellent — compiler schedules perfectly |
| Dynamic shapes / control flow | Good — cache adapts at runtime | Poor — compiler must pad/speculate |
| Sparse, data-dependent access | Moderate — L2 absorbs hot set | Poor — requires SparseCore bypass |
| Latency determinism | Low — cache misses cause variance | High — no cache misses possible |
| Energy efficiency | Moderate — cache tag checks cost power | High — no tag arrays, no coherence |
The energy argument
Hardware caches are not free. Tag arrays, comparators, eviction logic, and coherence protocols consume silicon area and power. On a modern GPU, the L2 cache alone can consume 5–10% of the chip's power budget. By eliminating hardware caches entirely, the TPU reclaims that power budget for compute and memory bandwidth — a direct efficiency advantage for workloads that don't need the cache's flexibility.
This is a significant part of Trillium's (v6e) 67% energy-efficiency improvement over v5e. It is not just better transistors — it is a fundamentally leaner memory interface that does not pay the overhead of hardware cache management.
Where this leads
The software-managed memory model is a bet on compiler technology. As XLA improves — better tiling strategies, better prefetch scheduling, better handling of dynamic shapes — the range of workloads that TPUs handle efficiently expands. But the fundamental constraint remains: the compiler operates on information available at compile time. Runtime-dependent access patterns will always require either hardware caching (the GPU approach) or purpose-built irregular-access processors (the SparseCore approach).
The interesting question is not which approach is "better" — it is which approach is better suited to the workloads that matter most. For dense transformer training and inference, the TPU's software-managed model is demonstrably more efficient. For general-purpose ML serving with dynamic batching, variable-length inputs, and diverse model architectures, the GPU's hardware-managed model remains more robust.
GPUs hide the memory system behind hardware caches. TPUs expose it to the compiler. The tradeoff is determinism vs. generality — and it explains more about the TPU's strengths and limitations than any FLOPS number ever will.