MAN\SH AI
AI Chips · Long-Form Industry Analysis

The Specialist Chip:
SRAM, Heterogeneous Inference,
and the Decode Bottleneck

The next structural shift in AI hardware is not SRAM replacing HBM. It is the realisation that LLM inference decomposes into distinct stages — and that each stage rewards a different kind of silicon. SRAM-heavy accelerators are positioned to own the exact decode subpath where conventional GPU stacks leave the most performance on the table. This edition adds deep technical diagrams of decode and decode-FFN, plus an interactive step-through of what actually happens each token generation step.

Updated May 14, 2026 Focus: Decode architecture, SRAM, HBM, Groq, Cerebras, NVIDIA Rubin+LPX v2 — with full decode diagrams
The next structural opportunity in AI chips is not SRAM replacing HBM. It is heterogeneous inference — different chips doing different pieces of the generation loop — with SRAM-heavy accelerators increasingly positioned to own the decode-side FFN and MoE path.

The Thesis in One Sentence

SRAM-route AI chips are unlikely to win by doing full-stack LLM inference alone, but they can become strategically powerful if they plug into mainstream GPU, TPU, or cloud-ASIC stacks as specialists for the bandwidth-bound part of decode.

This is a narrower claim than "SRAM is the next universal memory tier for AI." It is also a more credible one — it fits the physics of the hardware, the workload characteristics of transformer inference, and the direction of current industry announcements.

Why it matters now
As of May 2026, this is no longer speculative. NVIDIA's Rubin + Groq 3 LPX product describes exactly this division of labour in a mainstream platform. AWS and Cerebras describe a disaggregated Trainium + Cerebras path. The thesis has product validation.

Inference Is Not One Workload

The strongest arguments for heterogeneous inference start with a simple observation: a transformer serving stack does not behave like one monolithic kernel. At the highest level, LLM inference has three distinct stages with very different resource bottlenecks.

Three Stages of LLM Inference Stage 1 · Prefill Process full prompt Q_len = N (big) Compute-bound Dense GEMM, high FLOPS Poor SRAM fit Stage 2 · Decode Attn 1 token per step Q_len = 1 always BW + Capacity bound KV cache grows with N Mixed SRAM fit Stage 3 · Decode FFN 1 token per step Weight streaming Memory-BW bound Fixed weight size, no KV Best SRAM fit ★ Repeated once per output token (up to thousands of times)
Fig 1: Inference stages have fundamentally different resource profiles. Prefill is compute-bound (dense GEMM). Decode Attention is bandwidth+capacity-bound (KV cache). Decode FFN is bandwidth-bound with a fixed weight footprint — the best fit for SRAM-heavy accelerators.
Stage 1 · Prefill

The model reads the entire prompt at once. Q_len = N so the attention matrix is large — many queries against many keys. Very compute-intensive. HBM-heavy GPUs dominate here.

Stage 2 · Decode Attention

One token generated per step. Query length = 1. But the model must attend over the entire KV cache (all prior tokens). The KV cache grows with context and concurrent users — a capacity burden SRAM struggles with.

Stage 3 · Decode FFN

After attention, each token passes through feed-forward layers. These stream model weights at high bandwidth — but the weight size is fixed (not growing with context), making SRAM's bandwidth advantage decisive.

Deep Dive: What Actually Happens in Decode

The "decode phase" is the part of inference most people interact with — it's the token-by-token generation you see streaming out of ChatGPT or Claude. But what the hardware is doing each step is more complex than "generate one word." Here is what happens inside one decode step.

One Decode Step — Interactive Walkthrough
Input: the last generated token

The previous step produced a token ID (e.g., token 4821 = "the"). This integer is looked up in the embedding table — a matrix of shape [vocab_size × d_model] — to produce a dense float vector.
Token ID: 4821
Embedding lookup → [d_model=4096] vector
dtype: BF16
QKV projections — three weight-stream GEMMs

The embedding vector is multiplied by three weight matrices (W_Q, W_K, W_V) to produce Query, Key, and Value vectors for each attention head. These are the most compute-intensive operations in decode — but since Q_len=1, they are matrix-vector products (GEMV), not matrix-matrix products (GEMM). GEMV is almost entirely memory-bandwidth-bound, not compute-bound.
x · W_Q → Q [n_heads × head_dim]
x · W_K → K [n_kv_heads × head_dim]
x · W_V → V [n_kv_heads × head_dim]
Memory-bandwidth bound (GEMV not GEMM)
KV cache write — this is why DRAM capacity matters

The newly computed Key and Value vectors are appended to the KV cache — the persistent memory store of all prior tokens' K and V. This cache lives in HBM (or DRAM) because it grows with context length. At 128K context with 80 layers: the KV cache alone can exceed 60–80 GB for a 70B model in BF16. SRAM simply cannot hold this.
K[t] → append to K cache [seq_len × n_kv_heads × head_dim]
V[t] → append to V cache [same]
KV cache lives in HBM/DRAM — too large for SRAM
Attention scores — read the entire KV cache

For each head, compute dot products between the current Query and every Key in the KV cache. This reads the entire KV cache from HBM — all prior token keys — to produce attention score logits. This is the most memory-bandwidth-intensive step. Longer context → more bytes to read → slower. This is why FlashAttention and FlashDecode exist: to tile this read efficiently.
scores = Q · K_cache^T [1 × seq_len]
Read entire K cache from HBM: O(N × d) bytes
softmax(scores / √d_head)
Value aggregation — weighted sum of all past values

Multiply attention probabilities by the Value cache. This reads the entire V cache from HBM and produces a single vector per head (the attended output). The result is projected back to model dimension via W_O. Total HBM reads in attention: K_cache + V_cache + W_O weights.
output = attn_probs · V_cache [1 × d_head]
Read entire V cache from HBM: O(N × d) bytes
concat heads → project via W_O → [d_model]
FFN — the SRAM target: streaming fixed weights

After attention, the vector passes through the Feed-Forward Network: two (or three in SwiGLU) linear projections with a nonlinearity. These are again GEMV operations — streaming large weight matrices to multiply against a single activation vector. Unlike the KV cache, the FFN weights are fixed in size — they don't grow with context or batch size. This is the key insight: SRAM's high bandwidth matters, and its limited capacity is not a bottleneck here.
h = SwiGLU(x · W_gate, x · W_up) [d_ff=16384]
out = h · W_down → [d_model]
FFN weight size FIXED — perfect for SRAM-resident weights
W_gate + W_up + W_down ≈ 12×d_model×d_ff bytes
Output projection + sampling — produce next token

The output vector is projected to vocabulary logits via a weight matrix [d_model × vocab_size], scaled by temperature, and sampled (greedy argmax, top-p, or top-k). The resulting token ID becomes the input to the next decode step. The entire loop repeats until EOS or max length.
logits = output · W_lm_head [vocab_size=128K]
next_token = sample(softmax(logits / temp))
→ goto step 1 for next token
One Decode Step — Memory Access Pattern per Layer Token embed QKV Projection x·W_Q, x·W_K, x·W_V ← reads W_Q, W_K, W_V from HBM KV Cache (HBM) K[0..t], V[0..t] GROWS with context N Cannot fit in SRAM append K[t], V[t] Attention Q · K_cache^T → softmax → ·V_cache ← reads ENTIRE K+V cache from HBM each step read all K,V Output Proj ·W_O → residual add ← reads W_O from HBM ─── FFN path below (the SRAM opportunity) ─── FFN / MoE x·W_gate, x·W_up → SwiGLU → ·W_down FIXED weight size — SRAM target ★ residual add LM Head logits → sample → next token HBM reads: W_Q,K,V + KV cache SRAM target: FFN weights (fixed)
Fig 2: One decode step — full data flow per layer. HBM is read for every weight matrix and the entire KV cache. The FFN weights (W_gate, W_up, W_down) are fixed size regardless of context length or batch, making them the natural SRAM residency target.

The Decode FFN: Why It's the SRAM Opportunity

Of all the operations in decode, the feed-forward network (FFN) layers are the most SRAM-friendly. To understand why, you need to understand what the FFN does and what its memory footprint looks like.

What the FFN computes

Most modern transformers use a SwiGLU FFN (Llama, Mistral, Gemma, etc.):

gate = x × W_gate // [d_model → d_ff] e.g. 4096 → 16384
up = x × W_up // [d_model → d_ff]
hidden= SiLU(gate) × up // elementwise gating
out = hidden × W_down // [d_ff → d_model] 16384 → 4096

The memory footprint is fixed — this is the key

The FFN has three weight matrices. For Llama-3-70B (d_model=8192, d_ff=28,672 per layer, 80 layers):

MatrixShapeSize (BF16)Notes
W_gate8192 × 28672471 MB per layerLinear projection
W_up8192 × 28672471 MB per layerLinear projection
W_down28672 × 8192471 MB per layerDown projection
Per layer total~1.4 GBFixed — does not grow with context or batch
All 80 layers~112 GBFull model FFN weight budget

Each decode step must stream all 1.4 GB of FFN weights per layer through the compute unit. The activation vector (just 8192 floats — 16 KB) is tiny. This is a near-perfect memory-bandwidth-bound problem: minimal compute, maximal memory movement. The computation is essentially: "drag 1.4 GB across the memory bus to multiply it against a 16 KB vector."

Arithmetic intensity — why this screams for SRAM
Arithmetic intensity = FLOPs / bytes. For an FFN GEMV: 2 × 8192 × 28672 FLOPs / (8192 × 28672 × 2 bytes) = ~1 FLOP/byte. H100's peak ratio is ~300 FLOP/byte (BF16). We are using 0.3% of the available compute. The only thing that matters is how fast you can move bytes.
Memory Bandwidth — Why SRAM Wins FFN Decode Higher bandwidth = faster weight streaming = more tokens/second On-chip SRAM (Groq/Cerebras) ~20 TB/s H100 HBM3 3.35 TB/s H200 HBM3e 4.8 TB/s NVLink 4 (H100 inter-GPU) 900 GB/s CPU DRAM (DDR5) 300 GB/s H100 HBM BW SRAM ≈ 6× HBM3 → 6× faster weight streaming
Fig 3: Memory bandwidth hierarchy. On-chip SRAM delivers ~20 TB/s vs H100 HBM3's 3.35 TB/s — roughly 6×. For FFN decode (which is almost entirely bandwidth-bound), this directly translates to ~6× faster weight streaming and therefore ~6× more tokens/second — before any other system effects.

The fixed-weight advantage — why attention is harder

Attention also needs high bandwidth, but it has a problem SRAM cannot solve: the KV cache grows linearly with context length and concurrently served batch size. For a 128K context window with batch size 64 on Llama-3-70B:

KV Cache — grows unboundedly

128K tokens × 80 layers × 8 KV heads × 128 dims × 2 bytes × 2 (K+V) × 64 batch

≈ 2.1 TB — cannot fit in any current SRAM

FFN Weights — always fixed

80 layers × 3 matrices × 8192 × 28672 × 2 bytes

≈ 112 GB — fixed regardless of context or batch

Cerebras CS-3: 44 GB on-chip. Can hold ~40% of FFN weights resident.

MoE Makes the FFN Argument Even Stronger

Mixture-of-Experts (MoE) models like Mixtral 8×7B, DeepSeek-V3, and GPT-4 (speculated) use conditional routing: each token is sent to only 2–4 of N expert FFN blocks, not all of them. This makes the FFN-as-SRAM-target argument substantially stronger.

MoE Routing — Only Active Experts Need Bandwidth Token embedding Router top-2 experts Expert 2 ACTIVE ★ Expert 5 ACTIVE ★ Expert 1 (idle) Expert 3 (idle) Combine weighted sum of active experts Output → next layer Only 2/N experts need bandwidth per token → SRAM can hold the hot experts resident
Fig 4: MoE routing. For any given token, only 2 of N experts are active. In a sufficiently small MoE model, all experts can fit in on-chip SRAM. Even for larger MoE models, the most frequently accessed experts (hot experts) can be kept resident in SRAM, dramatically reducing HBM reads.

This is the scenario where SRAM becomes truly decisive: a MoE model with experts small enough to fit on-chip (or at least the hot subset), running decode where each token only routes to 2 of 64 experts. The SRAM chip serves those experts with ~20 TB/s local bandwidth while an HBM-heavy GPU handles attention and any cold-expert loads.

The SRAM Tradeoff Is Clear

SRAM-heavy chips make a very explicit bargain with physics and economics. They use expensive on-chip memory to deliver exceptional local bandwidth and low latency — but they do not scale capacity as gracefully as HBM-based systems.

Strengths
  • ~20 TB/s on-chip bandwidth vs ~3–5 TB/s HBM
  • Low, deterministic latency (no DRAM refresh, no page miss)
  • Strong fit for bandwidth-bound decode paths (FFN, MoE hot experts)
  • Excellent TPOP (time per output token) characteristics
  • Lower power per GB/s moved vs HBM
Weaknesses
  • Capacity is expensive and hard to scale — 44 GB is ceiling for many designs
  • KV cache for long-context is simply too large
  • Compute density can be less favorable than GPU for prefill
  • Full-stack serving economics are difficult to justify
  • Ecosystem maturity vs CUDA/ROCm is a real friction
The slogan that doesn't hold
"SRAM will replace HBM" is physically and economically untenable at the capacity scales that long-context serving requires. HBM keeps scaling in both capacity (144 GB+ per stack) and bandwidth (HBM3e at 4.8 TB/s). The stronger, more durable claim: SRAM wins the bandwidth-bound subproblem within a heterogeneous stack.

Stage-by-Stage SRAM Fit

Inference StageMain BottleneckSRAM FitWhy
Prefill Compute intensity (GEMM) Weak SRAM area spent on memory, not compute. Dense GEMM needs FLOPs, not bandwidth. HBM-heavy GPU wins.
Decode Attention Bandwidth + KV cache capacity Mixed Bandwidth helps, but KV cache capacity grows with context and batch size beyond any SRAM ceiling.
Decode FFN Weight streaming bandwidth Strongest ★ Fixed weight size, high bandwidth requirement, near-zero arithmetic intensity. SRAM bandwidth directly = token speed.
MoE FFN (hot experts) Expert weight bandwidth Best fit ★★ Active expert subset fits in SRAM. Hot experts resident on-chip. Cold experts fetched from HBM only when needed.

Heterogeneous Inference: The New Structural Answer

Heterogeneous inference means splitting serving across multiple hardware classes rather than insisting one platform do everything. The emerging blueprint:

Heterogeneous Inference — Hardware Division of Labour HBM-Heavy GPU / TPU / Cloud ASIC (NVIDIA Rubin, Google TPU v5, AWS Trainium) Prefill Dense GEMM · compute-bound Decode Attention KV cache · capacity-bound HBM capacity essential for large KV cache token vector SRAM-Heavy Accelerator (Groq LPU, Cerebras CS-3, NVIDIA LPX) Decode FFN / MoE Experts Weight streaming · bandwidth-bound · fixed size ~20 TB/s on-chip BW → 6× faster FFN decode General-purpose backbone Specialist for bandwidth-bound decode
Fig 5: Heterogeneous inference division of labour. HBM-heavy GPUs handle prefill (compute-bound) and decode attention (capacity-bound). SRAM-heavy accelerators handle decode FFN/MoE (bandwidth-bound with fixed weight size). Each chip operates near its natural Pareto frontier.

What Exists in Industry Today

As of May 2026, this is no longer speculative.

Clearest Example
NVIDIA Vera Rubin + Groq 3 LPX

NVIDIA's March 2026 Rubin materials describe Rubin GPUs handling prefill + decode attention while Groq 3 LPX handles latency-sensitive FFN and MoE decode. The cleanest explicit product embodiment of the SRAM-sidecar thesis.

Standalone Archetype
Groq LPU

On-chip SRAM as primary storage. Deterministic, low-latency inference. Pure real-world demonstration of the SRAM route in production. Excellent token speed on small-to-medium models.

Cloud Direction
AWS Trainium + Cerebras

Publicly described disaggregated inference: Trainium for prefill, Cerebras for decode. Not yet the exact decode-FFN-only split — but hyperscaler-grade willingness to assign different stages to different silicon.

Cerebras: promise and tension

Cerebras is the most technically dramatic case. The CS-3 wafer-scale chip carries 44 GB of on-chip SRAM and ~21 PB/s of memory bandwidth — extraordinary numbers. But 44 GB means Cerebras can hold the weights of models up to roughly 20–25B parameters resident on-chip in BF16, or a larger model in INT8. That works brilliantly for small-to-medium model FFN decode; it hits walls for 70B+ models or very long-context attention KV caches.

Best Case for Cerebras

Deeply integrated into a heterogeneous cloud stack: another accelerator handles prefill and decode attention; Cerebras owns decode FFN. Avoids the KV capacity ceiling entirely.

Harder Case

Tries to stand alone as a universal inference replacement and gets pulled into the most difficult economics — large model KV capacity, long-context scaling, whole-system cost justification.

The Best Counterarguments

  1. HBM keeps improving. HBM3e at 4.8 TB/s and upcoming HBM4 will narrow the gap. The SRAM bandwidth advantage shrinks if HBM roadmaps hold. (Counter: HBM4 projected at ~10 TB/s; SRAM can scale too, and 10 TB/s still trails 20+ TB/s SRAM by 2×.)
  2. Integration is hard. Splitting inference across chips introduces scheduling, communication latency, compiler complexity, and software orchestration. The token vector handoff between GPU and SRAM accelerator is a real latency budget.
  3. Economics depend on utilisation. A specialist chip only makes economic sense if operators can keep it busy. Decode FFN is not 100% of serving time — idle time hurts TCO.
  4. Not every model behaves the same. Dense models, MoE models, long-context models, and agentic workloads stress the stack differently. A chip optimised for MoE FFN decode may underperform on dense-attention-heavy workloads.
These objections don't kill the thesis
They simply mean the winning SRAM role is likely to be selective, integrated, and software-dependent rather than universal. A chip that wins one clearly-bounded subproblem and integrates cleanly with the dominant ecosystem is still a large business.

Does Specialisation Shrink the TAM?

Counterintuitively, no. A narrower functional role can mean a larger practical market.

If an SRAM-route company insists on selling a full alternative inference platform, it competes for a relatively small number of premium system decisions. But if it becomes the preferred decode-FFN accelerator inside mainstream GPU, TPU, or cloud-ASIC stacks, it rides the much larger shipment volume of those ecosystems.

The TAM story improves when SRAM stops trying to replace the whole machine and starts becoming the standard high-bandwidth part inside many machines.

Bottom Line

The structural opportunity in AI chips is no longer about one memory technology defeating another in a clean head-to-head replacement battle. It is about workflow decomposition.

HBM-heavy systems remain the general-purpose backbone. SRAM-heavy systems increasingly look like the specialist accelerators that reclaim performance in bandwidth-constrained decode paths — particularly FFN and MoE decode, where arithmetic intensity is near zero and the weight footprint is fixed.

The durable claim
SRAM wins the bandwidth-bound subproblem within a heterogeneous stack. This is a smaller claim than "SRAM is the future of all AI chips," but it is far more durable because it aligns with physics, workload characteristics, and now has explicit product validation from NVIDIA, AWS, Groq, and Cerebras.