← All posts
TPU Architecture · Memory Systems · Compiler Design

Software-Managed Memory Is the TPU's Real Advantage

Published · 7 min read

No hardware caches. No L1/L2. The TPU exposes raw scratchpad SRAM to the XLA compiler — and that changes everything about how the memory system behaves under load.

By Manish KL~17 min readTechnical Essay
Abstract

The defining architectural difference between TPUs and GPUs is not the systolic array — it is the memory management model. GPUs rely on hardware-managed L1/L2 caches: the programmer writes kernels, the hardware decides what gets cached. TPUs eliminate hardware caches entirely, exposing scratchpad SRAM (VMEM, CMEM) that the XLA compiler must explicitly schedule via DMA at compile time. This is closer to embedded-systems programming than GPU programming. When the workload is predictable (dense matrix multiplication, transformer attention), the compiler produces near-optimal memory schedules with deterministic latency and zero cache-miss penalties. When the workload is unpredictable (irregular embeddings, dynamic shapes, sparse access), the compiler cannot schedule what it cannot predict — and Google had to build an entirely separate processor (SparseCore) to handle it. This essay examines why software-managed memory is both the TPU's greatest strength and its most constraining abstraction.

0
hardware cache levels on a TPU (no L1/L2)
VMEM
per-TensorCore scratchpad SRAM for hot data
CMEM
shared on-chip SRAM for cross-core data exchange
XLA
compiler responsible for all data placement decisions

Two philosophies of memory management

GPU and TPU architects face the same fundamental problem: the gap between compute speed and memory speed. Both need to keep arithmetic units fed with data. But they solve this problem in opposite ways.

GPU vs TPU: Memory Management Philosophy GPU: Hardware-Managed programmer writes kernels → hardware decides caching Registers (per-thread, fastest) Shared Memory / L1 Cache (per-SM, configurable) L2 Cache (chip-wide, hardware-managed) HBM (main memory) ✓ flexible, programmer-friendly ✗ cache misses, thrashing, non-deterministic runtime decides data placement TPU: Software-Managed XLA compiler schedules DMA → no hardware caches VMEM (per-TensorCore scratchpad SRAM) CMEM (shared on-chip SRAM, cross-core) HBM (main memory) no L1, no L2 — compiler manages everything ✓ deterministic, zero cache misses, efficient ✗ compiler must predict all access patterns compile-time decides data placement
Figure 1. GPU memory is hardware-managed with multiple cache levels. TPU memory is software-managed with explicit scratchpad SRAM and no hardware caches. The tradeoff: determinism and efficiency vs. flexibility and programmer friendliness.

What VMEM and CMEM actually are

On a GPU, "shared memory" is a programmer-visible scratchpad, but L1/L2 caches are hardware-managed and transparent. On a TPU, everything is a scratchpad. There is no transparent cache — every byte's location is determined by the compiler.

The embedded-systems parallel: TPU memory management is architecturally closer to a DMA-driven DSP or microcontroller (think TI C6000 or ARM Cortex-M with tightly-coupled memory) than to a GPU. The compiler generates a schedule of DMA transfers that runs concurrently with computation — prefetching the next tile into VMEM while the current tile is being processed. This double-buffering pattern is the same technique used in real-time audio processing and radar signal processing.

Why this works beautifully for matrix multiplication

Dense matrix multiplication — the operation that dominates transformer training and inference — has perfectly predictable access patterns. The compiler knows, at compile time, exactly which tiles of weight matrices and activation matrices will be needed, in what order, and at what time. This makes it an ideal target for software-managed memory:

Why this breaks for irregular workloads

The software-managed model has a critical assumption: access patterns must be knowable at compile time. When they are not — when the access pattern depends on runtime data — the compiler cannot generate an optimal DMA schedule. It must fall back to conservative strategies that dramatically reduce efficiency.

Embedding lookups: the anti-pattern

Recommendation models use massive embedding tables — often hundreds of gigabytes — where each training example accesses a sparse, data-dependent subset of the table. The indices are unknown until the input arrives. The compiler cannot prefetch what it cannot predict. On a GPU, the hardware cache absorbs some of this unpredictability — frequently accessed embeddings naturally stay in L2. On a TPU, with no hardware cache, every embedding access is a full-latency HBM read.

This is exactly why Google built SparseCore: a separate tiled dataflow processor with its own scratchpad (spMEM), designed specifically to handle the gather/scatter/reduce operations that the MXU + VMEM path cannot efficiently serve. SparseCore is, architecturally, an admission that software-managed memory has a boundary — and irregular workloads live on the other side of it.

SparseCore exists because VMEM doesn't work for everything. It is a purpose-built workaround for the exact class of memory access patterns that a software-managed scratchpad cannot handle: sparse, data-dependent, runtime-determined indexing. Its existence is the clearest evidence of the software-managed model's limitation.

The compiler is the memory controller

On a GPU, the memory controller is a hardware unit that decides, at runtime, which cache line to evict and which to keep. On a TPU, XLA is the memory controller. It makes the same decisions — what data to place where, when to prefetch, when to evict — but at compile time, statically, before the program runs.

This has profound implications for what workloads TPUs can serve efficiently:

Workload PropertyGPU (Hardware Cache)TPU (Software Scratchpad)
Static, predictable accessGood — cache learns the patternExcellent — compiler schedules perfectly
Dynamic shapes / control flowGood — cache adapts at runtimePoor — compiler must pad/speculate
Sparse, data-dependent accessModerate — L2 absorbs hot setPoor — requires SparseCore bypass
Latency determinismLow — cache misses cause varianceHigh — no cache misses possible
Energy efficiencyModerate — cache tag checks cost powerHigh — no tag arrays, no coherence

The energy argument

Hardware caches are not free. Tag arrays, comparators, eviction logic, and coherence protocols consume silicon area and power. On a modern GPU, the L2 cache alone can consume 5–10% of the chip's power budget. By eliminating hardware caches entirely, the TPU reclaims that power budget for compute and memory bandwidth — a direct efficiency advantage for workloads that don't need the cache's flexibility.

This is a significant part of Trillium's (v6e) 67% energy-efficiency improvement over v5e. It is not just better transistors — it is a fundamentally leaner memory interface that does not pay the overhead of hardware cache management.

Where this leads

The software-managed memory model is a bet on compiler technology. As XLA improves — better tiling strategies, better prefetch scheduling, better handling of dynamic shapes — the range of workloads that TPUs handle efficiently expands. But the fundamental constraint remains: the compiler operates on information available at compile time. Runtime-dependent access patterns will always require either hardware caching (the GPU approach) or purpose-built irregular-access processors (the SparseCore approach).

The interesting question is not which approach is "better" — it is which approach is better suited to the workloads that matter most. For dense transformer training and inference, the TPU's software-managed model is demonstrably more efficient. For general-purpose ML serving with dynamic batching, variable-length inputs, and diverse model architectures, the GPU's hardware-managed model remains more robust.

GPUs hide the memory system behind hardware caches. TPUs expose it to the compiler. The tradeoff is determinism vs. generality — and it explains more about the TPU's strengths and limitations than any FLOPS number ever will.