Every AI processor debate eventually circles back to the same misdiagnosis: we need more compute. The actual constraint is almost never the arithmetic. It is the speed at which memory can feed data to the cores that are already sitting idle, waiting.
Strip away the acronyms, the benchmark sheets, and the marketing slides, and any AI processor — GPU, TPU, ASIC, wafer-scale chip — does exactly three things:
Understanding AI processor performance is mostly the exercise of finding which of these three steps is the slowest — the binding constraint — and understanding why. In traditional computing, the bottleneck was often calculation: CPUs simply couldn't do arithmetic fast enough. That era ended for AI workloads when tensor cores arrived.
Today, for AI inference, the bottleneck is almost never step one. It is almost always a combination of steps two and three. The question is not how fast the processor can calculate. It is how fast memory can supply the data that calculations need.
The first principle of AI inference performance: computation is cheap. Data movement is expensive. Any architecture that treats them as equally important misallocates its silicon.
One of the most frequently observed facts about GPU inference is that GPU utilisation during autoregressive decode is surprisingly low. On a well-provisioned H200 serving single requests, tensor core utilisation during decode can sit at 10–25%. On a well-batched system at batch 64, it climbs to 40–60%. Even at best-case batch 256, tensor cores spend significant time idle.
This is often treated as a software problem, or a batching problem, or a scheduling inefficiency. It is none of those. It is a diagnostic — a signal that the arithmetic units have completed their work and are waiting for memory to deliver the next set of numbers. The GPU is not slow. The memory is slow. The GPU finishes its matrix multiply and then waits, drawing power, producing no tokens, while HBM transfers the next set of weights across the bus.
To see why, consider the arithmetic intensity of a single decode step. For a 70B model at FP8 precision:
The GPU is not broken. It is arithmetically correct — it finishes its assigned work in 0.12 milliseconds and then has nothing to do for the next 21 milliseconds. The bottleneck is not the cores. It is the pipe that feeds them.
There are not many types of memory — there are fundamentally two, built from different physics with opposite tradeoffs. Everything else in the memory hierarchy is a variant or composition of these two.
The fundamental tradeoff is inescapable: SRAM is fast but small and expensive per bit; DRAM is large and cheap per bit but slow. No amount of engineering eliminates this tradeoff — it is baked into the physics of the two cell structures. The question for AI hardware designers is not which one to use. It is which one to use for which data, at what scale, and how to manage the boundary between them.
The GPU was built for graphics rendering, and graphics has a specific computational profile that made DRAM a reasonable choice. In a render pipeline, you load a large scene — geometry, textures, lighting data — from memory, and then run a long, compute-intensive shader calculation on it. The ratio of computation to data movement is high. The memory is loaded once; the compute runs for a long time. DRAM's latency is a small fraction of the total render time. It barely matters.
AI inference has precisely the opposite profile. The compute — a single matrix-vector multiply per layer — is tiny. The data movement — reading all weight matrices for every token generated — is enormous and happens repeatedly, step after step. There is no long compute phase to amortise the slow data fetch. The entire workload is structured around returning to memory after every token.
It is worth making the scale of the memory movement concrete — because the numbers are large enough to be surprising even when you know they are coming.
To generate a single output token from a 70B model at FP8 precision, the following data must travel from HBM to the compute cores:
| Data Object | Size (FP8) | What it is | How often read |
|---|---|---|---|
| Query, Key, Value weight matrices | ~20.5 GB | Attention projections across all 80 layers | Every token |
| Feed-forward network weights | ~47.2 GB | MLP up/down/gate projections, all layers | Every token |
| Output projection weights | ~2.1 GB | Final linear layer + layer norms | Every token |
| KV cache (2K context) | ~2.5 GB | Cached keys and values for prior tokens | Every token |
| Total per token | ~72 GB | Across the memory bus, for every single token generated | |
72 gigabytes. Per token. On every decode step. This data does not change between tokens — the weights are fixed for the duration of inference. Yet because the DRAM/HBM interface has no concept of persistence, no ability to say "keep this data warm because I will need it again in 21 milliseconds," the full parade crosses the bus again and again.
A useful intuition for scale: 72 GB is roughly the uncompressed data of 90 feature-length HD films. The GPU reads the equivalent of 90 HD movies' worth of data for every word in its response — and then reads it all again for the next word.
The structural problem is not that the data is large. It is that the data is unchanging and must be re-fetched anyway, because the HBM interface was designed for a workload where data is loaded once, not a workload where the same data is loaded serially, indefinitely, token after token.
If the bottleneck is the speed of data movement from memory to compute, and SRAM moves data 200× faster than HBM, the natural conclusion is: put the weights in SRAM. If the weights are already on-chip, in SRAM adjacent to the compute cores, the 21ms memory wait per decode step shrinks to microseconds. The GPU-waiting-for-HBM problem disappears.
This is exactly correct as physics. The problem is density. SRAM stores approximately 1–4 MB per square millimetre of silicon at modern process nodes. A 70B FP8 model weighs 70 GB. To store 70 GB in SRAM, you need approximately 17,500–70,000 mm² of silicon — purely for memory.
The largest conventional chips today are around 800–1,000 mm². A wafer-scale chip — the largest thing that can be physically manufactured as a single die — is approximately 46,000 mm² (Cerebras WSE-3). So storing a 70B model entirely in SRAM requires, at minimum, a Cerebras-scale wafer — and that leaves essentially no room for the compute cores themselves.
| Memory Type | Bandwidth (to compute) | Density (MB/mm²) | Cost per GB | Implication for 70B model |
|---|---|---|---|---|
| On-chip SRAM | 10–50 TB/s | 1–4 MB/mm² | $5,000–$20,000 | Would need ~46,000 mm² — wafer-scale chip just for weights |
| HBM3e (stacked DRAM) | 3.35 TB/s | ~4,000 MB/mm² (stacked) | $10–15 | Fits in 18 GB × 8 stacks (H200). Fast enough for training; slow for decode. |
| DDR5 / LPDDR5 (system DRAM) | ~0.1 TB/s | ~8,000 MB/mm² (packaged) | $3–6 | Cheap and dense. 33× slower than HBM. Unsuitable as primary inference memory. |
| NVMe flash (SSD) | ~12–28 GB/s | Very high | $0.08–0.15 | Orders of magnitude too slow for weight reads. Used for KV offload only. |
The SRAM solution is thermodynamically correct and economically brutal. The physics says SRAM wins. The economics says you cannot buy enough of it. This is the central tension that has shaped every high-speed inference architecture built in the last four years.
Given that you cannot fit a 70B model into the SRAM of a single conventional chip, there are two architectural strategies for capturing most of SRAM's bandwidth advantage while working within manufacturing reality.
Slice the model. Assign each shard of the model weights to a separate chip — a chiplet, a tile, a processor element — each equipped with enough local SRAM to hold its assigned portion. With enough chips in the system, the total SRAM capacity covers the full model. Each chip operates at SRAM speeds locally.
This is the strategy behind inference processors built from large arrays of SRAM-equipped tiles. At the chip level, it works well — each tile reads from its local SRAM at full bandwidth. The cost is paid at the system level: the tiles must communicate. When layer N on tile 23 needs to send its activations to layer N+1 on tile 47, that data must cross a physical link — a serialised bus, a PCIe lane, a network switch. That crossing is significantly slower than staying on-chip.
Instead of many small chips communicating over slow links, build a single chip large enough to contain both the compute and a large fraction of the weight data in on-chip SRAM. Communication that would have crossed a slow chip-to-chip link now stays on-die — on a fast, low-latency, high-bandwidth on-chip network.
The constraint is manufacturing yield: a larger chip has a higher probability of containing a defect. Wafer-scale manufacturing addresses this through redundancy — building in spare tiles that can substitute for defective ones. The result is a chip the size of the entire wafer (approximately 220 mm × 220 mm for current 200mm-equivalent wafers), with tens of thousands of compute tiles each equipped with local SRAM and connected by an on-die fabric.
The fundamental reason Strategy B (bigger chip) has an advantage over Strategy A (many chips) is the cost of crossing a chip boundary. This is not a software problem or a protocol overhead problem. It is physics.
On-chip wires operate at full digital speed — a signal crossing the die in a few nanoseconds, at picojoules per bit. They are short, well-controlled impedance paths with minimal energy loss. On-chip bandwidth can reach tens of terabytes per second because there are many such wires running in parallel.
Off-chip signals — the wires that leave the chip, run through a package, cross a PCB trace, or travel through an optical or copper cable — face a completely different physical environment. They must drive higher-impedance loads. They need larger voltage swings to survive the journey. They pick up noise. They require termination. Each bit costs more energy and takes more time.
| Communication path | Bandwidth (typical) | Energy (pJ/bit) | Latency | Relative cost vs. on-chip |
|---|---|---|---|---|
| On-chip SRAM → compute core | 10–50 TB/s | ~0.01–0.1 | 0.5–2 ns | Baseline 1× |
| Die-to-die (chiplet, UCIe) | 1–5 TB/s | ~0.5–2 | 5–20 ns | ~10–20× more energy, ~10× slower |
| NVLink 4.0 (GPU-to-GPU) | 0.9 TB/s bidir | ~2–5 | ~1 µs | ~50× more energy, ~500× slower latency |
| PCIe Gen5 x16 | ~128 GB/s | ~10–20 | ~1–5 µs | ~200× more energy, ~1000× slower |
| 400G Ethernet (rack-to-rack) | ~42 GB/s eff. | ~50–200 | ~10–100 µs | ~2000× more energy, ~10,000× slower |
Each hop away from the compute core — from on-chip to die-to-die to PCIe to network — costs roughly an order of magnitude in bandwidth and an order of magnitude in energy per bit. The on-chip → HBM hop alone is a 200× bandwidth penalty. This is not a gap that better protocols or faster cables can close. It is the consequence of the physics of signal propagation at scale.
The interconnect tax compounds. A system that divides a model across hundreds of small chips, each with fast local SRAM, can gain SRAM-speed access to local data — but must pay the interconnect tax every time an activation crosses a chip boundary. Depending on how the model is partitioned and how frequently cross-chip communication is required, the interconnect tax can consume a significant fraction of the SRAM bandwidth advantage.
The wafer-scale argument is simple to state: if you cannot afford to cross chip boundaries without paying a large interconnect tax, build a chip large enough that the critical data flows stay on-chip. The bigger the chip, the more SRAM fits on it, and the less frequently data must leave it.
A wafer-scale chip is approximately 46,225 mm² (Cerebras WSE-3 at 215mm × 215mm). At 4 MB/mm² SRAM density (at a 5nm-class node), this is approximately 185 TB of potential SRAM — far more than the 70 GB needed for a 70B model, with ample room for compute. The entire model fits on-chip. Every weight read is an SRAM read. The HBM interface is not needed for inference at all.
The practical constraints are:
The GPU vs. SRAM-chip framing is useful as a diagnostic but misleading as an architectural prescription. In practice, every AI processor — including the GPU — contains both SRAM and DRAM in different proportions. The H100 has 50 MB of L2 cache (on-chip SRAM) and 80 GB of HBM. The H200 has 50 MB of L2 and 141 GB of HBM. Cerebras WSE-3 has ~44 GB of on-chip SRAM and no HBM at all (external DRAM for very large models). Groq LPU has ~220 MB of SRAM per chip, no HBM, and scales by connecting hundreds of chips.
The difference is not presence or absence of SRAM — it is the ratio, the position of SRAM in the serving path, and how much of the critical inference data can be kept in SRAM for how long.
The system designer's job is to place data in the tier appropriate to its hotness and access frequency. Weights that are read every token should live as close to the compute cores as their size permits. KV cache entries that have low reuse probability can be demoted. Activation tiles that are consumed and discarded within a layer can live in the smallest, fastest tier.
This is the insight behind software-managed scratchpad architectures (the TPU's approach), behind hardware wired-residency proposals, behind the MCOS framework, and behind every modern KV eviction and admission policy. They are all attempts to put data in the right tier based on its actual access pattern — because the physics of the tiers are so different that misplacement has severe consequences.
The compute benchmark is not the right question for an AI processor. TFLOP/s per dollar, TOPS/W, matrix multiply throughput — these numbers are largely irrelevant in a regime where the compute cores are idle 99% of the time waiting for memory.
The right questions are:
Every architectural choice in AI inference — HBM vs. SRAM, monolithic vs. chiplet, wafer-scale vs. conventional die, software scratchpad vs. hardware cache — is ultimately a negotiation with the same underlying physics. The core is fast and cheap. The memory is slow and expensive per unit of bandwidth. The chip boundary is a tax that compounds with distance.
The processor that wins inference is not the one with the most arithmetic power. It is the one that has found the best answer to a much older, much more physical question: how do you get the right data, to the right place, at the right time, without paying more than the workload is worth?
Inference speed is a memory problem. It has always been a memory problem. The GPU was an extraordinary answer to a different question — and an expensive mismatch for this one. Understanding that mismatch is the beginning of understanding why AI hardware is being rebuilt from the physics up.