AI Hardware Memory Systems SRAM vs DRAM Inference Architecture

The Memory Bottleneck: Why Inference Speed Is a Memory Problem, Not a Compute Problem

Published Apr 17, 2026 · 22 min read

Every AI processor debate eventually circles back to the same misdiagnosis: we need more compute. The actual constraint is almost never the arithmetic. It is the speed at which memory can feed data to the cores that are already sitting idle, waiting.

By MANISH AI · April 2026 · ~16 min read · Hardware Essay

Abstract. This essay argues from first principles that the binding constraint in AI inference is not arithmetic throughput — it is memory bandwidth and memory architecture. Starting from what an AI processor actually does, we trace how the physics of SRAM versus DRAM/HBM determines inference latency, why the GPU was the wrong memory architecture for inference despite being the right architecture for graphics, how SRAM-based designs attempt to solve the bottleneck, and what the fundamental tradeoffs between density, bandwidth, on-chip area, and inter-chip communication mean for the next generation of AI processors. The goal is not to declare a winner. The goal is to make the constraint visible — because once you see that inference speed is a memory problem, the entire hardware landscape reads differently.

~1.6 FLOP/byte arithmetic intensity during decode — 90× below the H200 hardware ridge

200× bandwidth advantage of on-chip SRAM over HBM3e at the same power envelope

~70 GB weight data that must cross the memory bus for every single decode step of a 70B model

99% of a decode step's wall-clock time spent waiting for memory, not doing arithmetic

Contents

What a processor actually does: three things, one bottleneck
The idle GPU is not a performance anomaly — it is a diagnostic
The two memories: what DRAM and SRAM are, and why they exist
Why graphics was the perfect DRAM use case — and inference is the opposite
The weight parade: what actually crosses the memory bus per token
The SRAM solution: why it works and what it costs
Two strategies for making SRAM practical at scale
The interconnect tax: why chip-to-chip is always slower than on-chip
The wafer-scale argument: more square millimetres, fewer crossings
The memory hierarchy is not a single choice — it is a spectrum
The right question for every AI processor

1. What a processor actually does: three things, one bottleneck

Strip away the acronyms, the benchmark sheets, and the marketing slides, and any AI processor — GPU, TPU, ASIC, wafer-scale chip — does exactly three things:

Calculate. Run arithmetic operations — multiplications, additions, activations — on numbers. In AI, this is almost entirely matrix multiplication: multiplying weight matrices against activation vectors to produce the next layer's outputs.

Store. Hold the numbers that calculation needs and produces. Some numbers are model weights (fixed during inference). Some are intermediate activations (ephemeral between layers). Some are the KV cache (growing as the conversation extends). All of them must live somewhere — and that somewhere is memory.

Move. Transfer numbers from where they are stored to where the calculation units can use them — and back again. This movement is not free. It costs time. It costs power. And in AI inference, it is the step that takes the most of both.

Understanding AI processor performance is mostly the exercise of finding which of these three steps is the slowest — the binding constraint — and understanding why. In traditional computing, the bottleneck was often calculation: CPUs simply couldn't do arithmetic fast enough. That era ended for AI workloads when tensor cores arrived.

Today, for AI inference, the bottleneck is almost never step one. It is almost always a combination of steps two and three. The question is not how fast the processor can calculate. It is how fast memory can supply the data that calculations need.

The first principle of AI inference performance: computation is cheap. Data movement is expensive. Any architecture that treats them as equally important misallocates its silicon.

2. The idle GPU is not a performance anomaly — it is a diagnostic

One of the most frequently observed facts about GPU inference is that GPU utilisation during autoregressive decode is surprisingly low. On a well-provisioned H200 serving single requests, tensor core utilisation during decode can sit at 10–25%. On a well-batched system at batch 64, it climbs to 40–60%. Even at best-case batch 256, tensor cores spend significant time idle.

This is often treated as a software problem, or a batching problem, or a scheduling inefficiency. It is none of those. It is a diagnostic — a signal that the arithmetic units have completed their work and are waiting for memory to deliver the next set of numbers. The GPU is not slow. The memory is slow. The GPU finishes its matrix multiply and then waits, drawing power, producing no tokens, while HBM transfers the next set of weights across the bus.

To see why, consider the arithmetic intensity of a single decode step. For a 70B model at FP8 precision:

Arithmetic intensity — autoregressive decode, 70B FP8, single request

FLOPs required per decode step: ~118 GFLOPs
Bytes read from HBM per decode step: ~72 GB (70GB weights + 2GB KV cache at 2K ctx)

Arithmetic intensity = 118 × 10⁹ FLOPs / 72 × 10⁹ bytes
= ~1.6 FLOP/byte

H200 hardware ridge point: ~148 FLOP/byte
→ Decode operates at 1/90th the intensity needed to keep compute busy.
→ At 3.35 TB/s HBM bandwidth: memory time = 21.5 ms per step.
→ At 989 TFLOP/s compute: compute time = 0.12 ms per step.
The GPU is waiting for memory for 99.4% of each decode step.

The GPU is not broken. It is arithmetically correct — it finishes its assigned work in 0.12 milliseconds and then has nothing to do for the next 21 milliseconds. The bottleneck is not the cores. It is the pipe that feeds them.

3. The two memories: what DRAM and SRAM are, and why they exist

There are not many types of memory — there are fundamentally two, built from different physics with opposite tradeoffs. Everything else in the memory hierarchy is a variant or composition of these two.

SRAM — Static RAM

Cell structure: 6 transistors per bit. Data held by two cross-coupled inverters in stable states.
Bandwidth: Extremely high. Data can be read in 1–4 clock cycles. Feeds compute cores in nanoseconds.
Latency: ~0.5–2 ns. Direct path to compute.
Density: Low. 6 transistors per bit means large area per bit stored. ~1–4 MB/mm² at modern nodes.
Power: Leaks constantly (static power), even when idle.
Cost: High per bit — area is the most expensive resource on a chip.
Where it lives: L1/L2/L3 cache, register files, scratchpad SRAM, on-chip buffers.

DRAM / HBM — Dynamic RAM

Cell structure: 1 transistor + 1 capacitor per bit. Data held as charge on a capacitor that leaks and must be refreshed.
Bandwidth: Limited by the bus. HBM3e achieves ~3.35 TB/s — impressive as DRAM goes, but slow relative to SRAM.
Latency: ~50–100 ns to access (DDR); ~30 ns for HBM. Much slower than SRAM.
Density: High. 1T1C means very compact storage. ~100–200 GB per HBM stack; terabytes possible in a system.
Power: Refresh cycles plus read/write energy. ~3.9 pJ/bit for HBM3e.
Cost: Low per bit — extremely dense, manufactured at scale.
Where it lives: GPU HBM, system DRAM, CXL memory expansion.

The fundamental tradeoff is inescapable: SRAM is fast but small and expensive per bit; DRAM is large and cheap per bit but slow. No amount of engineering eliminates this tradeoff — it is baked into the physics of the two cell structures. The question for AI hardware designers is not which one to use. It is which one to use for which data, at what scale, and how to manage the boundary between them.

Fig 1. The fundamental memory tradeoff: bandwidth to compute versus storage capacity. SRAM lives at high bandwidth, low capacity. DRAM/HBM lives at high capacity, low bandwidth. The ideal point — both high — is not accessible with current semiconductor physics. Every AI memory architecture is a negotiation with this tradeoff.

4. Why graphics was the perfect DRAM use case — and inference is the opposite

The GPU was built for graphics rendering, and graphics has a specific computational profile that made DRAM a reasonable choice. In a render pipeline, you load a large scene — geometry, textures, lighting data — from memory, and then run a long, compute-intensive shader calculation on it. The ratio of computation to data movement is high. The memory is loaded once; the compute runs for a long time. DRAM's latency is a small fraction of the total render time. It barely matters.

AI inference has precisely the opposite profile. The compute — a single matrix-vector multiply per layer — is tiny. The data movement — reading all weight matrices for every token generated — is enormous and happens repeatedly, step after step. There is no long compute phase to amortise the slow data fetch. The entire workload is structured around returning to memory after every token.

Fig 2. Time profile comparison. In graphics, a single data load precedes a long compute phase — DRAM's slowness is amortised over hundreds of milliseconds of shader work. In AI inference decode, the same enormous weight data must be loaded again for every single token generated. There is no long compute phase to hide behind. The architecture mismatch is structural.

5. The weight parade: what actually crosses the memory bus per token

It is worth making the scale of the memory movement concrete — because the numbers are large enough to be surprising even when you know they are coming.

To generate a single output token from a 70B model at FP8 precision, the following data must travel from HBM to the compute cores:

Data Object	Size (FP8)	What it is	How often read
Query, Key, Value weight matrices	~20.5 GB	Attention projections across all 80 layers	Every token
Feed-forward network weights	~47.2 GB	MLP up/down/gate projections, all layers	Every token
Output projection weights	~2.1 GB	Final linear layer + layer norms	Every token
KV cache (2K context)	~2.5 GB	Cached keys and values for prior tokens	Every token
Total per token	~72 GB	Across the memory bus, for every single token generated

72 gigabytes. Per token. On every decode step. This data does not change between tokens — the weights are fixed for the duration of inference. Yet because the DRAM/HBM interface has no concept of persistence, no ability to say "keep this data warm because I will need it again in 21 milliseconds," the full parade crosses the bus again and again.

A useful intuition for scale: 72 GB is roughly the uncompressed data of 90 feature-length HD films. The GPU reads the equivalent of 90 HD movies' worth of data for every word in its response — and then reads it all again for the next word.

The structural problem is not that the data is large. It is that the data is unchanging and must be re-fetched anyway, because the HBM interface was designed for a workload where data is loaded once, not a workload where the same data is loaded serially, indefinitely, token after token.

6. The SRAM solution: why it works and what it costs

If the bottleneck is the speed of data movement from memory to compute, and SRAM moves data 200× faster than HBM, the natural conclusion is: put the weights in SRAM. If the weights are already on-chip, in SRAM adjacent to the compute cores, the 21ms memory wait per decode step shrinks to microseconds. The GPU-waiting-for-HBM problem disappears.

This is exactly correct as physics. The problem is density. SRAM stores approximately 1–4 MB per square millimetre of silicon at modern process nodes. A 70B FP8 model weighs 70 GB. To store 70 GB in SRAM, you need approximately 17,500–70,000 mm² of silicon — purely for memory.

The largest conventional chips today are around 800–1,000 mm². A wafer-scale chip — the largest thing that can be physically manufactured as a single die — is approximately 46,000 mm² (Cerebras WSE-3). So storing a 70B model entirely in SRAM requires, at minimum, a Cerebras-scale wafer — and that leaves essentially no room for the compute cores themselves.

Memory Type	Bandwidth (to compute)	Density (MB/mm²)	Cost per GB	Implication for 70B model
On-chip SRAM	10–50 TB/s	1–4 MB/mm²	$5,000–$20,000	Would need ~46,000 mm² — wafer-scale chip just for weights
HBM3e (stacked DRAM)	3.35 TB/s	~4,000 MB/mm² (stacked)	$10–15	Fits in 18 GB × 8 stacks (H200). Fast enough for training; slow for decode.
DDR5 / LPDDR5 (system DRAM)	~0.1 TB/s	~8,000 MB/mm² (packaged)	$3–6	Cheap and dense. 33× slower than HBM. Unsuitable as primary inference memory.
NVMe flash (SSD)	~12–28 GB/s	Very high	$0.08–0.15	Orders of magnitude too slow for weight reads. Used for KV offload only.

The SRAM solution is thermodynamically correct and economically brutal. The physics says SRAM wins. The economics says you cannot buy enough of it. This is the central tension that has shaped every high-speed inference architecture built in the last four years.

7. Two strategies for making SRAM practical at scale

Given that you cannot fit a 70B model into the SRAM of a single conventional chip, there are two architectural strategies for capturing most of SRAM's bandwidth advantage while working within manufacturing reality.

Strategy A: Divide and conquer — many small chips, each with its own SRAM

Slice the model. Assign each shard of the model weights to a separate chip — a chiplet, a tile, a processor element — each equipped with enough local SRAM to hold its assigned portion. With enough chips in the system, the total SRAM capacity covers the full model. Each chip operates at SRAM speeds locally.

This is the strategy behind inference processors built from large arrays of SRAM-equipped tiles. At the chip level, it works well — each tile reads from its local SRAM at full bandwidth. The cost is paid at the system level: the tiles must communicate. When layer N on tile 23 needs to send its activations to layer N+1 on tile 47, that data must cross a physical link — a serialised bus, a PCIe lane, a network switch. That crossing is significantly slower than staying on-chip.

Strategy B: Build a bigger chip — wafer-scale

Instead of many small chips communicating over slow links, build a single chip large enough to contain both the compute and a large fraction of the weight data in on-chip SRAM. Communication that would have crossed a slow chip-to-chip link now stays on-die — on a fast, low-latency, high-bandwidth on-chip network.

The constraint is manufacturing yield: a larger chip has a higher probability of containing a defect. Wafer-scale manufacturing addresses this through redundancy — building in spare tiles that can substitute for defective ones. The result is a chip the size of the entire wafer (approximately 220 mm × 220 mm for current 200mm-equivalent wafers), with tens of thousands of compute tiles each equipped with local SRAM and connected by an on-die fabric.

8. The interconnect tax: why chip-to-chip is always slower than on-chip

The fundamental reason Strategy B (bigger chip) has an advantage over Strategy A (many chips) is the cost of crossing a chip boundary. This is not a software problem or a protocol overhead problem. It is physics.

On-chip wires operate at full digital speed — a signal crossing the die in a few nanoseconds, at picojoules per bit. They are short, well-controlled impedance paths with minimal energy loss. On-chip bandwidth can reach tens of terabytes per second because there are many such wires running in parallel.

Off-chip signals — the wires that leave the chip, run through a package, cross a PCB trace, or travel through an optical or copper cable — face a completely different physical environment. They must drive higher-impedance loads. They need larger voltage swings to survive the journey. They pick up noise. They require termination. Each bit costs more energy and takes more time.

Communication path	Bandwidth (typical)	Energy (pJ/bit)	Latency	Relative cost vs. on-chip
On-chip SRAM → compute core	10–50 TB/s	~0.01–0.1	0.5–2 ns	Baseline 1×
Die-to-die (chiplet, UCIe)	1–5 TB/s	~0.5–2	5–20 ns	~10–20× more energy, ~10× slower
NVLink 4.0 (GPU-to-GPU)	0.9 TB/s bidir	~2–5	~1 µs	~50× more energy, ~500× slower latency
PCIe Gen5 x16	~128 GB/s	~10–20	~1–5 µs	~200× more energy, ~1000× slower
400G Ethernet (rack-to-rack)	~42 GB/s eff.	~50–200	~10–100 µs	~2000× more energy, ~10,000× slower

Each hop away from the compute core — from on-chip to die-to-die to PCIe to network — costs roughly an order of magnitude in bandwidth and an order of magnitude in energy per bit. The on-chip → HBM hop alone is a 200× bandwidth penalty. This is not a gap that better protocols or faster cables can close. It is the consequence of the physics of signal propagation at scale.

The interconnect tax compounds. A system that divides a model across hundreds of small chips, each with fast local SRAM, can gain SRAM-speed access to local data — but must pay the interconnect tax every time an activation crosses a chip boundary. Depending on how the model is partitioned and how frequently cross-chip communication is required, the interconnect tax can consume a significant fraction of the SRAM bandwidth advantage.

9. The wafer-scale argument: more square millimetres, fewer crossings

The wafer-scale argument is simple to state: if you cannot afford to cross chip boundaries without paying a large interconnect tax, build a chip large enough that the critical data flows stay on-chip. The bigger the chip, the more SRAM fits on it, and the less frequently data must leave it.

A wafer-scale chip is approximately 46,225 mm² (Cerebras WSE-3 at 215mm × 215mm). At 4 MB/mm² SRAM density (at a 5nm-class node), this is approximately 185 TB of potential SRAM — far more than the 70 GB needed for a 70B model, with ample room for compute. The entire model fits on-chip. Every weight read is an SRAM read. The HBM interface is not needed for inference at all.

The practical constraints are:

Yield. A larger chip has more area in which a manufacturing defect can occur. Wafer-scale designs address this through redundancy — building spare tiles into the design so that defective tiles can be routed around. This requires extremely mature defect characterisation and redundancy architecture, but is solvable.

Thermal management. A 46,000 mm² chip at high utilisation can generate enormous heat that must be uniformly removed. Custom cooling solutions — direct liquid cooling, custom heat spreaders, precise flow management across the wafer surface — are required.

Model size ceiling. A single wafer can hold a certain amount of SRAM. As models grow to 400B, 1T, and beyond, even wafer-scale SRAM is insufficient to hold the full model. At that point, some inter-chip communication is unavoidable — the design question becomes minimising the number of boundary crossings required per token, not eliminating them entirely.

10. The memory hierarchy is not a single choice — it is a spectrum

The GPU vs. SRAM-chip framing is useful as a diagnostic but misleading as an architectural prescription. In practice, every AI processor — including the GPU — contains both SRAM and DRAM in different proportions. The H100 has 50 MB of L2 cache (on-chip SRAM) and 80 GB of HBM. The H200 has 50 MB of L2 and 141 GB of HBM. Cerebras WSE-3 has ~44 GB of on-chip SRAM and no HBM at all (external DRAM for very large models). Groq LPU has ~220 MB of SRAM per chip, no HBM, and scales by connecting hundreds of chips.

The difference is not presence or absence of SRAM — it is the ratio, the position of SRAM in the serving path, and how much of the critical inference data can be kept in SRAM for how long.

Register file — <1 MB, 1 cycle L1 SRAM — 1–4 MB, 2–4 cycles L2 SRAM — 10–500 MB, 10–30 cycles HBM — 60–141 GB, 300+ cycles CXL DRAM — TBs, 1,000+ cycles NVMe — TBs, millions of cycles

The system designer's job is to place data in the tier appropriate to its hotness and access frequency. Weights that are read every token should live as close to the compute cores as their size permits. KV cache entries that have low reuse probability can be demoted. Activation tiles that are consumed and discarded within a layer can live in the smallest, fastest tier.

This is the insight behind software-managed scratchpad architectures (the TPU's approach), behind hardware wired-residency proposals, behind the MCOS framework, and behind every modern KV eviction and admission policy. They are all attempts to put data in the right tier based on its actual access pattern — because the physics of the tiers are so different that misplacement has severe consequences.

11. The right question for every AI processor

The compute benchmark is not the right question for an AI processor. TFLOP/s per dollar, TOPS/W, matrix multiply throughput — these numbers are largely irrelevant in a regime where the compute cores are idle 99% of the time waiting for memory.

The right questions are:

How much of the model's working set can be held in SRAM, adjacent to compute, without crossing a slow boundary? This is the primary determinant of decode speed for a given model size.

What is the effective bandwidth between where data lives and where compute runs? Not peak HBM bandwidth — effective bandwidth, accounting for access patterns, queue depth, and the ratio of useful transfers to total transfers.

How many chip boundary crossings are required per token, and what is the bandwidth and energy cost of each? For multi-chip SRAM systems, the answer to this question determines how much of the SRAM advantage survives at system scale.

Does the architecture know what data is hot, and can it act on that knowledge? A system with explicit residency control — whether through software-managed scratchpad, compiler-scheduled DMA, or hardware wired-residency — can keep weights in SRAM across decode steps. A system that relies on generic cache replacement may evict them.

Every architectural choice in AI inference — HBM vs. SRAM, monolithic vs. chiplet, wafer-scale vs. conventional die, software scratchpad vs. hardware cache — is ultimately a negotiation with the same underlying physics. The core is fast and cheap. The memory is slow and expensive per unit of bandwidth. The chip boundary is a tax that compounds with distance.

The processor that wins inference is not the one with the most arithmetic power. It is the one that has found the best answer to a much older, much more physical question: how do you get the right data, to the right place, at the right time, without paying more than the workload is worth?

Inference speed is a memory problem. It has always been a memory problem. The GPU was an extraordinary answer to a different question — and an expensive mismatch for this one. Understanding that mismatch is the beginning of understanding why AI hardware is being rebuilt from the physics up.

What this essay does not cover

This essay focuses on the bandwidth and latency dimensions of the memory bottleneck in autoregressive decode. It does not cover: prefill (which is compute-bound, not memory-bound, and where the GPU's TFLOP advantage is real), training (where gradient accumulation and weight update patterns change the memory access model significantly), the specific engineering tradeoffs of HBM stacking versus near-memory compute, or the economics of SRAM vs. HBM cost at wafer scale. The MoE model case — where expert sparsity changes which weights must be read per token — also deserves its own treatment.