SRAM as the Specialist: Heterogeneous Inference and the Decode Bottleneck

The Thesis in One Sentence

SRAM-route AI chips are unlikely to win by doing full-stack LLM inference alone, but they can become strategically powerful if they plug into mainstream GPU, TPU, or cloud-ASIC stacks as specialists for the bandwidth-bound part of decode.

This is a narrower claim than "SRAM is the next universal memory tier for AI." It is also a more credible one — it fits the physics of the hardware, the workload characteristics of transformer inference, and the direction of current industry announcements.

Why it matters now

As of May 2026, this is no longer speculative. NVIDIA's Rubin + Groq 3 LPX product describes exactly this division of labour in a mainstream platform. AWS and Cerebras describe a disaggregated Trainium + Cerebras path. The thesis has product validation.

Inference Is Not One Workload

The strongest arguments for heterogeneous inference start with a simple observation: a transformer serving stack does not behave like one monolithic kernel. At the highest level, LLM inference has three distinct stages with very different resource bottlenecks.

Fig 1: Inference stages have fundamentally different resource profiles. Prefill is compute-bound (dense GEMM). Decode Attention is bandwidth+capacity-bound (KV cache). Decode FFN is bandwidth-bound with a fixed weight footprint — the best fit for SRAM-heavy accelerators.

Stage 1 · Prefill

The model reads the entire prompt at once. Q_len = N so the attention matrix is large — many queries against many keys. Very compute-intensive. HBM-heavy GPUs dominate here.

Stage 2 · Decode Attention

One token generated per step. Query length = 1. But the model must attend over the entire KV cache (all prior tokens). The KV cache grows with context and concurrent users — a capacity burden SRAM struggles with.

Stage 3 · Decode FFN

After attention, each token passes through feed-forward layers. These stream model weights at high bandwidth — but the weight size is fixed (not growing with context), making SRAM's bandwidth advantage decisive.

Deep Dive: What Actually Happens in Decode

The "decode phase" is the part of inference most people interact with — it's the token-by-token generation you see streaming out of ChatGPT or Claude. But what the hardware is doing each step is more complex than "generate one word." Here is what happens inside one decode step.

One Decode Step — Interactive Walkthrough

Input: the last generated token

The previous step produced a token ID (e.g., token 4821 = "the"). This integer is looked up in the embedding table — a matrix of shape [vocab_size × d_model] — to produce a dense float vector.

Token ID: 4821

Embedding lookup → [d_model=4096] vector

dtype: BF16

QKV projections — three weight-stream GEMMs

The embedding vector is multiplied by three weight matrices (W_Q, W_K, W_V) to produce Query, Key, and Value vectors for each attention head. These are the most compute-intensive operations in decode — but since Q_len=1, they are matrix-vector products (GEMV), not matrix-matrix products (GEMM). GEMV is almost entirely memory-bandwidth-bound, not compute-bound.

x · W_Q → Q [n_heads × head_dim]

x · W_K → K [n_kv_heads × head_dim]

x · W_V → V [n_kv_heads × head_dim]

Memory-bandwidth bound (GEMV not GEMM)

KV cache write — this is why DRAM capacity matters

The newly computed Key and Value vectors are appended to the KV cache — the persistent memory store of all prior tokens' K and V. This cache lives in HBM (or DRAM) because it grows with context length. At 128K context with 80 layers: the KV cache alone can exceed 60–80 GB for a 70B model in BF16. SRAM simply cannot hold this.

K[t] → append to K cache [seq_len × n_kv_heads × head_dim]

V[t] → append to V cache [same]

KV cache lives in HBM/DRAM — too large for SRAM

Attention scores — read the entire KV cache

For each head, compute dot products between the current Query and every Key in the KV cache. This reads the entire KV cache from HBM — all prior token keys — to produce attention score logits. This is the most memory-bandwidth-intensive step. Longer context → more bytes to read → slower. This is why FlashAttention and FlashDecode exist: to tile this read efficiently.

scores = Q · K_cache^T [1 × seq_len]

Read entire K cache from HBM: O(N × d) bytes

softmax(scores / √d_head)

Value aggregation — weighted sum of all past values

Multiply attention probabilities by the Value cache. This reads the entire V cache from HBM and produces a single vector per head (the attended output). The result is projected back to model dimension via W_O. Total HBM reads in attention: K_cache + V_cache + W_O weights.

output = attn_probs · V_cache [1 × d_head]

Read entire V cache from HBM: O(N × d) bytes

concat heads → project via W_O → [d_model]

FFN — the SRAM target: streaming fixed weights

After attention, the vector passes through the Feed-Forward Network: two (or three in SwiGLU) linear projections with a nonlinearity. These are again GEMV operations — streaming large weight matrices to multiply against a single activation vector. Unlike the KV cache, the FFN weights are fixed in size — they don't grow with context or batch size. This is the key insight: SRAM's high bandwidth matters, and its limited capacity is not a bottleneck here.

h = SwiGLU(x · W_gate, x · W_up) [d_ff=16384]

out = h · W_down → [d_model]

FFN weight size FIXED — perfect for SRAM-resident weights

W_gate + W_up + W_down ≈ 12×d_model×d_ff bytes

Output projection + sampling — produce next token

The output vector is projected to vocabulary logits via a weight matrix [d_model × vocab_size], scaled by temperature, and sampled (greedy argmax, top-p, or top-k). The resulting token ID becomes the input to the next decode step. The entire loop repeats until EOS or max length.

logits = output · W_lm_head [vocab_size=128K]

next_token = sample(softmax(logits / temp))

→ goto step 1 for next token

Fig 2: One decode step — full data flow per layer. HBM is read for every weight matrix and the entire KV cache. The FFN weights (W_gate, W_up, W_down) are fixed size regardless of context length or batch, making them the natural SRAM residency target.

The Decode FFN: Why It's the SRAM Opportunity

Of all the operations in decode, the feed-forward network (FFN) layers are the most SRAM-friendly. To understand why, you need to understand what the FFN does and what its memory footprint looks like.

What the FFN computes

Most modern transformers use a SwiGLU FFN (Llama, Mistral, Gemma, etc.):

gate = x × W_gate // [d_model → d_ff] e.g. 4096 → 16384
up = x × W_up // [d_model → d_ff]
hidden= SiLU(gate) × up // elementwise gating
out = hidden × W_down // [d_ff → d_model] 16384 → 4096

The memory footprint is fixed — this is the key

The FFN has three weight matrices. For Llama-3-70B (d_model=8192, d_ff=28,672 per layer, 80 layers):

Matrix	Shape	Size (BF16)	Notes
`W_gate`	8192 × 28672	471 MB per layer	Linear projection
`W_up`	8192 × 28672	471 MB per layer	Linear projection
`W_down`	28672 × 8192	471 MB per layer	Down projection
Per layer total		~1.4 GB	Fixed — does not grow with context or batch
All 80 layers		~112 GB	Full model FFN weight budget

Each decode step must stream all 1.4 GB of FFN weights per layer through the compute unit. The activation vector (just 8192 floats — 16 KB) is tiny. This is a near-perfect memory-bandwidth-bound problem: minimal compute, maximal memory movement. The computation is essentially: "drag 1.4 GB across the memory bus to multiply it against a 16 KB vector."

Arithmetic intensity — why this screams for SRAM

Arithmetic intensity = FLOPs / bytes. For an FFN GEMV: 2 × 8192 × 28672 FLOPs / (8192 × 28672 × 2 bytes) = ~1 FLOP/byte. H100's peak ratio is ~300 FLOP/byte (BF16). We are using 0.3% of the available compute. The only thing that matters is how fast you can move bytes.

Fig 3: Memory bandwidth hierarchy. On-chip SRAM delivers ~20 TB/s vs H100 HBM3's 3.35 TB/s — roughly 6×. For FFN decode (which is almost entirely bandwidth-bound), this directly translates to ~6× faster weight streaming and therefore ~6× more tokens/second — before any other system effects.

The fixed-weight advantage — why attention is harder

Attention also needs high bandwidth, but it has a problem SRAM cannot solve: the KV cache grows linearly with context length and concurrently served batch size. For a 128K context window with batch size 64 on Llama-3-70B:

KV Cache — grows unboundedly

128K tokens × 80 layers × 8 KV heads × 128 dims × 2 bytes × 2 (K+V) × 64 batch

≈ 2.1 TB — cannot fit in any current SRAM

FFN Weights — always fixed

80 layers × 3 matrices × 8192 × 28672 × 2 bytes

≈ 112 GB — fixed regardless of context or batch

Cerebras CS-3: 44 GB on-chip. Can hold ~40% of FFN weights resident.

MoE Makes the FFN Argument Even Stronger

Mixture-of-Experts (MoE) models like Mixtral 8×7B, DeepSeek-V3, and GPT-4 (speculated) use conditional routing: each token is sent to only 2–4 of N expert FFN blocks, not all of them. This makes the FFN-as-SRAM-target argument substantially stronger.

Fig 4: MoE routing. For any given token, only 2 of N experts are active. In a sufficiently small MoE model, all experts can fit in on-chip SRAM. Even for larger MoE models, the most frequently accessed experts (hot experts) can be kept resident in SRAM, dramatically reducing HBM reads.

This is the scenario where SRAM becomes truly decisive: a MoE model with experts small enough to fit on-chip (or at least the hot subset), running decode where each token only routes to 2 of 64 experts. The SRAM chip serves those experts with ~20 TB/s local bandwidth while an HBM-heavy GPU handles attention and any cold-expert loads.

The SRAM Tradeoff Is Clear

SRAM-heavy chips make a very explicit bargain with physics and economics. They use expensive on-chip memory to deliver exceptional local bandwidth and low latency — but they do not scale capacity as gracefully as HBM-based systems.

Strengths

~20 TB/s on-chip bandwidth vs ~3–5 TB/s HBM
Low, deterministic latency (no DRAM refresh, no page miss)
Strong fit for bandwidth-bound decode paths (FFN, MoE hot experts)
Excellent TPOP (time per output token) characteristics
Lower power per GB/s moved vs HBM

Weaknesses

Capacity is expensive and hard to scale — 44 GB is ceiling for many designs
KV cache for long-context is simply too large
Compute density can be less favorable than GPU for prefill
Full-stack serving economics are difficult to justify
Ecosystem maturity vs CUDA/ROCm is a real friction

The slogan that doesn't hold

"SRAM will replace HBM" is physically and economically untenable at the capacity scales that long-context serving requires. HBM keeps scaling in both capacity (144 GB+ per stack) and bandwidth (HBM3e at 4.8 TB/s). The stronger, more durable claim: SRAM wins the bandwidth-bound subproblem within a heterogeneous stack.

Stage-by-Stage SRAM Fit

Inference Stage	Main Bottleneck	SRAM Fit	Why
Prefill	Compute intensity (GEMM)	Weak	SRAM area spent on memory, not compute. Dense GEMM needs FLOPs, not bandwidth. HBM-heavy GPU wins.
Decode Attention	Bandwidth + KV cache capacity	Mixed	Bandwidth helps, but KV cache capacity grows with context and batch size beyond any SRAM ceiling.
Decode FFN	Weight streaming bandwidth	Strongest ★	Fixed weight size, high bandwidth requirement, near-zero arithmetic intensity. SRAM bandwidth directly = token speed.
MoE FFN (hot experts)	Expert weight bandwidth	Best fit ★★	Active expert subset fits in SRAM. Hot experts resident on-chip. Cold experts fetched from HBM only when needed.

Heterogeneous Inference: The New Structural Answer

Heterogeneous inference means splitting serving across multiple hardware classes rather than insisting one platform do everything. The emerging blueprint:

Fig 5: Heterogeneous inference division of labour. HBM-heavy GPUs handle prefill (compute-bound) and decode attention (capacity-bound). SRAM-heavy accelerators handle decode FFN/MoE (bandwidth-bound with fixed weight size). Each chip operates near its natural Pareto frontier.

What Exists in Industry Today

As of May 2026, this is no longer speculative.

Clearest Example

NVIDIA Vera Rubin + Groq 3 LPX

NVIDIA's March 2026 Rubin materials describe Rubin GPUs handling prefill + decode attention while Groq 3 LPX handles latency-sensitive FFN and MoE decode. The cleanest explicit product embodiment of the SRAM-sidecar thesis.

Standalone Archetype

Groq LPU

On-chip SRAM as primary storage. Deterministic, low-latency inference. Pure real-world demonstration of the SRAM route in production. Excellent token speed on small-to-medium models.

Cloud Direction

AWS Trainium + Cerebras

Publicly described disaggregated inference: Trainium for prefill, Cerebras for decode. Not yet the exact decode-FFN-only split — but hyperscaler-grade willingness to assign different stages to different silicon.

Cerebras: promise and tension

Cerebras is the most technically dramatic case. The CS-3 wafer-scale chip carries 44 GB of on-chip SRAM and ~21 PB/s of memory bandwidth — extraordinary numbers. But 44 GB means Cerebras can hold the weights of models up to roughly 20–25B parameters resident on-chip in BF16, or a larger model in INT8. That works brilliantly for small-to-medium model FFN decode; it hits walls for 70B+ models or very long-context attention KV caches.

Best Case for Cerebras

Deeply integrated into a heterogeneous cloud stack: another accelerator handles prefill and decode attention; Cerebras owns decode FFN. Avoids the KV capacity ceiling entirely.

Harder Case

Tries to stand alone as a universal inference replacement and gets pulled into the most difficult economics — large model KV capacity, long-context scaling, whole-system cost justification.

The Best Counterarguments

HBM keeps improving. HBM3e at 4.8 TB/s and upcoming HBM4 will narrow the gap. The SRAM bandwidth advantage shrinks if HBM roadmaps hold. (Counter: HBM4 projected at ~10 TB/s; SRAM can scale too, and 10 TB/s still trails 20+ TB/s SRAM by 2×.)
Integration is hard. Splitting inference across chips introduces scheduling, communication latency, compiler complexity, and software orchestration. The token vector handoff between GPU and SRAM accelerator is a real latency budget.
Economics depend on utilisation. A specialist chip only makes economic sense if operators can keep it busy. Decode FFN is not 100% of serving time — idle time hurts TCO.
Not every model behaves the same. Dense models, MoE models, long-context models, and agentic workloads stress the stack differently. A chip optimised for MoE FFN decode may underperform on dense-attention-heavy workloads.

These objections don't kill the thesis

They simply mean the winning SRAM role is likely to be selective, integrated, and software-dependent rather than universal. A chip that wins one clearly-bounded subproblem and integrates cleanly with the dominant ecosystem is still a large business.

Does Specialisation Shrink the TAM?

Counterintuitively, no. A narrower functional role can mean a larger practical market.

If an SRAM-route company insists on selling a full alternative inference platform, it competes for a relatively small number of premium system decisions. But if it becomes the preferred decode-FFN accelerator inside mainstream GPU, TPU, or cloud-ASIC stacks, it rides the much larger shipment volume of those ecosystems.

The TAM story improves when SRAM stops trying to replace the whole machine and starts becoming the standard high-bandwidth part inside many machines.

Bottom Line

The structural opportunity in AI chips is no longer about one memory technology defeating another in a clean head-to-head replacement battle. It is about workflow decomposition.

HBM-heavy systems remain the general-purpose backbone. SRAM-heavy systems increasingly look like the specialist accelerators that reclaim performance in bandwidth-constrained decode paths — particularly FFN and MoE decode, where arithmetic intensity is near zero and the weight footprint is fixed.

The durable claim

SRAM wins the bandwidth-bound subproblem within a heterogeneous stack. This is a smaller claim than "SRAM is the future of all AI chips," but it is far more durable because it aligns with physics, workload characteristics, and now has explicit product validation from NVIDIA, AWS, Groq, and Cerebras.

The Specialist Chip:
SRAM, Heterogeneous Inference,
and the Decode Bottleneck