Software-Managed Memory Is the TPU's Real Advantage — and Its Hardest Abstraction

Abstract

The defining architectural difference between TPUs and GPUs is not the systolic array — it is the memory management model. GPUs rely on hardware-managed L1/L2 caches: the programmer writes kernels, the hardware decides what gets cached. TPUs eliminate hardware caches entirely, exposing scratchpad SRAM (VMEM, CMEM) that the XLA compiler must explicitly schedule via DMA at compile time. This is closer to embedded-systems programming than GPU programming. When the workload is predictable (dense matrix multiplication, transformer attention), the compiler produces near-optimal memory schedules with deterministic latency and zero cache-miss penalties. When the workload is unpredictable (irregular embeddings, dynamic shapes, sparse access), the compiler cannot schedule what it cannot predict — and Google had to build an entirely separate processor (SparseCore) to handle it. This essay examines why software-managed memory is both the TPU's greatest strength and its most constraining abstraction.

Two philosophies of memory management

GPU and TPU architects face the same fundamental problem: the gap between compute speed and memory speed. Both need to keep arithmetic units fed with data. But they solve this problem in opposite ways.

Figure 1. GPU memory is hardware-managed with multiple cache levels. TPU memory is software-managed with explicit scratchpad SRAM and no hardware caches. The tradeoff: determinism and efficiency vs. flexibility and programmer friendliness.

What VMEM and CMEM actually are

On a GPU, "shared memory" is a programmer-visible scratchpad, but L1/L2 caches are hardware-managed and transparent. On a TPU, everything is a scratchpad. There is no transparent cache — every byte's location is determined by the compiler.

VMEM (Vector Memory): A fast SRAM scratchpad local to each TensorCore. It holds the "hot" data that the MXU (systolic array) and vector unit consume. The XLA compiler issues explicit DMA commands to transfer data between HBM and VMEM before each computation begins. If the data isn't in VMEM when the MXU needs it, the systolic array stalls — there is no fallback "cache miss" path.
CMEM (Common Memory): A shared SRAM pool accessible to all TensorCores on the chip. It serves as a producer-consumer buffer: one TensorCore writes partial results to CMEM, and another reads them, without round-tripping through HBM. This eliminates the inter-core communication overhead that GPUs handle via L2 cache or global memory.

The embedded-systems parallel: TPU memory management is architecturally closer to a DMA-driven DSP or microcontroller (think TI C6000 or ARM Cortex-M with tightly-coupled memory) than to a GPU. The compiler generates a schedule of DMA transfers that runs concurrently with computation — prefetching the next tile into VMEM while the current tile is being processed. This double-buffering pattern is the same technique used in real-time audio processing and radar signal processing.

Why this works beautifully for matrix multiplication

Dense matrix multiplication — the operation that dominates transformer training and inference — has perfectly predictable access patterns. The compiler knows, at compile time, exactly which tiles of weight matrices and activation matrices will be needed, in what order, and at what time. This makes it an ideal target for software-managed memory:

Tile scheduling: The compiler partitions the matrix into tiles that fit in VMEM, then generates a DMA schedule that prefetches the next tile while the current one is being multiplied in the MXU. The systolic array never stalls because the next tile is always ready.
Zero cache-miss overhead: There is no cache hierarchy to miss in. Data is either in VMEM (and available at SRAM speed) or in HBM (and will be DMA'd explicitly). The latency is deterministic — there is no variance from cache contention, eviction, or thrashing.
No wasted capacity: Hardware caches inevitably hold stale data (cache lines that were loaded but won't be used again). Software-managed scratchpad eliminates this waste — every byte in VMEM was explicitly placed there because the compiler determined it would be needed.

Why this breaks for irregular workloads

The software-managed model has a critical assumption: access patterns must be knowable at compile time. When they are not — when the access pattern depends on runtime data — the compiler cannot generate an optimal DMA schedule. It must fall back to conservative strategies that dramatically reduce efficiency.

Embedding lookups: the anti-pattern

Recommendation models use massive embedding tables — often hundreds of gigabytes — where each training example accesses a sparse, data-dependent subset of the table. The indices are unknown until the input arrives. The compiler cannot prefetch what it cannot predict. On a GPU, the hardware cache absorbs some of this unpredictability — frequently accessed embeddings naturally stay in L2. On a TPU, with no hardware cache, every embedding access is a full-latency HBM read.

This is exactly why Google built SparseCore: a separate tiled dataflow processor with its own scratchpad (spMEM), designed specifically to handle the gather/scatter/reduce operations that the MXU + VMEM path cannot efficiently serve. SparseCore is, architecturally, an admission that software-managed memory has a boundary — and irregular workloads live on the other side of it.

SparseCore exists because VMEM doesn't work for everything. It is a purpose-built workaround for the exact class of memory access patterns that a software-managed scratchpad cannot handle: sparse, data-dependent, runtime-determined indexing. Its existence is the clearest evidence of the software-managed model's limitation.

The compiler is the memory controller

On a GPU, the memory controller is a hardware unit that decides, at runtime, which cache line to evict and which to keep. On a TPU, XLA is the memory controller. It makes the same decisions — what data to place where, when to prefetch, when to evict — but at compile time, statically, before the program runs.

This has profound implications for what workloads TPUs can serve efficiently:

Workload Property	GPU (Hardware Cache)	TPU (Software Scratchpad)
Static, predictable access	Good — cache learns the pattern	Excellent — compiler schedules perfectly
Dynamic shapes / control flow	Good — cache adapts at runtime	Poor — compiler must pad/speculate
Sparse, data-dependent access	Moderate — L2 absorbs hot set	Poor — requires SparseCore bypass
Latency determinism	Low — cache misses cause variance	High — no cache misses possible
Energy efficiency	Moderate — cache tag checks cost power	High — no tag arrays, no coherence

The energy argument

Hardware caches are not free. Tag arrays, comparators, eviction logic, and coherence protocols consume silicon area and power. On a modern GPU, the L2 cache alone can consume 5–10% of the chip's power budget. By eliminating hardware caches entirely, the TPU reclaims that power budget for compute and memory bandwidth — a direct efficiency advantage for workloads that don't need the cache's flexibility.

This is a significant part of Trillium's (v6e) 67% energy-efficiency improvement over v5e. It is not just better transistors — it is a fundamentally leaner memory interface that does not pay the overhead of hardware cache management.

Where this leads

The software-managed memory model is a bet on compiler technology. As XLA improves — better tiling strategies, better prefetch scheduling, better handling of dynamic shapes — the range of workloads that TPUs handle efficiently expands. But the fundamental constraint remains: the compiler operates on information available at compile time. Runtime-dependent access patterns will always require either hardware caching (the GPU approach) or purpose-built irregular-access processors (the SparseCore approach).

The interesting question is not which approach is "better" — it is which approach is better suited to the workloads that matter most. For dense transformer training and inference, the TPU's software-managed model is demonstrably more efficient. For general-purpose ML serving with dynamic batching, variable-length inputs, and diverse model architectures, the GPU's hardware-managed model remains more robust.

GPUs hide the memory system behind hardware caches. TPUs expose it to the compiler. The tradeoff is determinism vs. generality — and it explains more about the TPU's strengths and limitations than any FLOPS number ever will.

tpu-software-managed-memory-real-advantage.html · April 2026 · ← All writings