The Thesis in One Sentence
SRAM-route AI chips are unlikely to win by doing full-stack LLM inference alone, but they can become strategically powerful if they plug into mainstream GPU, TPU, or cloud-ASIC stacks as specialists for the bandwidth-bound part of decode.
This is a narrower claim than "SRAM is the next universal memory tier for AI." It is also a more credible one — it fits the physics of the hardware, the workload characteristics of transformer inference, and the direction of current industry announcements.
Inference Is Not One Workload
The strongest arguments for heterogeneous inference start with a simple observation: a transformer serving stack does not behave like one monolithic kernel. At the highest level, LLM inference has three distinct stages with very different resource bottlenecks.
The model reads the entire prompt at once. Q_len = N so the attention matrix is large — many queries against many keys. Very compute-intensive. HBM-heavy GPUs dominate here.
One token generated per step. Query length = 1. But the model must attend over the entire KV cache (all prior tokens). The KV cache grows with context and concurrent users — a capacity burden SRAM struggles with.
After attention, each token passes through feed-forward layers. These stream model weights at high bandwidth — but the weight size is fixed (not growing with context), making SRAM's bandwidth advantage decisive.
Deep Dive: What Actually Happens in Decode
The "decode phase" is the part of inference most people interact with — it's the token-by-token generation you see streaming out of ChatGPT or Claude. But what the hardware is doing each step is more complex than "generate one word." Here is what happens inside one decode step.
The previous step produced a token ID (e.g., token 4821 = "the"). This integer is looked up in the embedding table — a matrix of shape
[vocab_size × d_model] — to produce a dense float vector.
The embedding vector is multiplied by three weight matrices (W_Q, W_K, W_V) to produce Query, Key, and Value vectors for each attention head. These are the most compute-intensive operations in decode — but since Q_len=1, they are matrix-vector products (GEMV), not matrix-matrix products (GEMM). GEMV is almost entirely memory-bandwidth-bound, not compute-bound.
The newly computed Key and Value vectors are appended to the KV cache — the persistent memory store of all prior tokens' K and V. This cache lives in HBM (or DRAM) because it grows with context length. At 128K context with 80 layers: the KV cache alone can exceed 60–80 GB for a 70B model in BF16. SRAM simply cannot hold this.
For each head, compute dot products between the current Query and every Key in the KV cache. This reads the entire KV cache from HBM — all prior token keys — to produce attention score logits. This is the most memory-bandwidth-intensive step. Longer context → more bytes to read → slower. This is why FlashAttention and FlashDecode exist: to tile this read efficiently.
Multiply attention probabilities by the Value cache. This reads the entire V cache from HBM and produces a single vector per head (the attended output). The result is projected back to model dimension via W_O. Total HBM reads in attention: K_cache + V_cache + W_O weights.
After attention, the vector passes through the Feed-Forward Network: two (or three in SwiGLU) linear projections with a nonlinearity. These are again GEMV operations — streaming large weight matrices to multiply against a single activation vector. Unlike the KV cache, the FFN weights are fixed in size — they don't grow with context or batch size. This is the key insight: SRAM's high bandwidth matters, and its limited capacity is not a bottleneck here.
The output vector is projected to vocabulary logits via a weight matrix [d_model × vocab_size], scaled by temperature, and sampled (greedy argmax, top-p, or top-k). The resulting token ID becomes the input to the next decode step. The entire loop repeats until EOS or max length.
The Decode FFN: Why It's the SRAM Opportunity
Of all the operations in decode, the feed-forward network (FFN) layers are the most SRAM-friendly. To understand why, you need to understand what the FFN does and what its memory footprint looks like.
What the FFN computes
Most modern transformers use a SwiGLU FFN (Llama, Mistral, Gemma, etc.):
up = x × W_up // [d_model → d_ff]
hidden= SiLU(gate) × up // elementwise gating
out = hidden × W_down // [d_ff → d_model] 16384 → 4096
The memory footprint is fixed — this is the key
The FFN has three weight matrices. For Llama-3-70B (d_model=8192, d_ff=28,672 per layer, 80 layers):
| Matrix | Shape | Size (BF16) | Notes |
|---|---|---|---|
W_gate | 8192 × 28672 | 471 MB per layer | Linear projection |
W_up | 8192 × 28672 | 471 MB per layer | Linear projection |
W_down | 28672 × 8192 | 471 MB per layer | Down projection |
| Per layer total | ~1.4 GB | Fixed — does not grow with context or batch | |
| All 80 layers | ~112 GB | Full model FFN weight budget | |
Each decode step must stream all 1.4 GB of FFN weights per layer through the compute unit. The activation vector (just 8192 floats — 16 KB) is tiny. This is a near-perfect memory-bandwidth-bound problem: minimal compute, maximal memory movement. The computation is essentially: "drag 1.4 GB across the memory bus to multiply it against a 16 KB vector."
The fixed-weight advantage — why attention is harder
Attention also needs high bandwidth, but it has a problem SRAM cannot solve: the KV cache grows linearly with context length and concurrently served batch size. For a 128K context window with batch size 64 on Llama-3-70B:
128K tokens × 80 layers × 8 KV heads × 128 dims × 2 bytes × 2 (K+V) × 64 batch
≈ 2.1 TB — cannot fit in any current SRAM
80 layers × 3 matrices × 8192 × 28672 × 2 bytes
≈ 112 GB — fixed regardless of context or batch
Cerebras CS-3: 44 GB on-chip. Can hold ~40% of FFN weights resident.
MoE Makes the FFN Argument Even Stronger
Mixture-of-Experts (MoE) models like Mixtral 8×7B, DeepSeek-V3, and GPT-4 (speculated) use conditional routing: each token is sent to only 2–4 of N expert FFN blocks, not all of them. This makes the FFN-as-SRAM-target argument substantially stronger.
This is the scenario where SRAM becomes truly decisive: a MoE model with experts small enough to fit on-chip (or at least the hot subset), running decode where each token only routes to 2 of 64 experts. The SRAM chip serves those experts with ~20 TB/s local bandwidth while an HBM-heavy GPU handles attention and any cold-expert loads.
The SRAM Tradeoff Is Clear
SRAM-heavy chips make a very explicit bargain with physics and economics. They use expensive on-chip memory to deliver exceptional local bandwidth and low latency — but they do not scale capacity as gracefully as HBM-based systems.
- ~20 TB/s on-chip bandwidth vs ~3–5 TB/s HBM
- Low, deterministic latency (no DRAM refresh, no page miss)
- Strong fit for bandwidth-bound decode paths (FFN, MoE hot experts)
- Excellent TPOP (time per output token) characteristics
- Lower power per GB/s moved vs HBM
- Capacity is expensive and hard to scale — 44 GB is ceiling for many designs
- KV cache for long-context is simply too large
- Compute density can be less favorable than GPU for prefill
- Full-stack serving economics are difficult to justify
- Ecosystem maturity vs CUDA/ROCm is a real friction
Stage-by-Stage SRAM Fit
| Inference Stage | Main Bottleneck | SRAM Fit | Why |
|---|---|---|---|
| Prefill | Compute intensity (GEMM) | Weak | SRAM area spent on memory, not compute. Dense GEMM needs FLOPs, not bandwidth. HBM-heavy GPU wins. |
| Decode Attention | Bandwidth + KV cache capacity | Mixed | Bandwidth helps, but KV cache capacity grows with context and batch size beyond any SRAM ceiling. |
| Decode FFN | Weight streaming bandwidth | Strongest ★ | Fixed weight size, high bandwidth requirement, near-zero arithmetic intensity. SRAM bandwidth directly = token speed. |
| MoE FFN (hot experts) | Expert weight bandwidth | Best fit ★★ | Active expert subset fits in SRAM. Hot experts resident on-chip. Cold experts fetched from HBM only when needed. |
Heterogeneous Inference: The New Structural Answer
Heterogeneous inference means splitting serving across multiple hardware classes rather than insisting one platform do everything. The emerging blueprint:
What Exists in Industry Today
As of May 2026, this is no longer speculative.
NVIDIA's March 2026 Rubin materials describe Rubin GPUs handling prefill + decode attention while Groq 3 LPX handles latency-sensitive FFN and MoE decode. The cleanest explicit product embodiment of the SRAM-sidecar thesis.
On-chip SRAM as primary storage. Deterministic, low-latency inference. Pure real-world demonstration of the SRAM route in production. Excellent token speed on small-to-medium models.
Publicly described disaggregated inference: Trainium for prefill, Cerebras for decode. Not yet the exact decode-FFN-only split — but hyperscaler-grade willingness to assign different stages to different silicon.
Cerebras: promise and tension
Cerebras is the most technically dramatic case. The CS-3 wafer-scale chip carries 44 GB of on-chip SRAM and ~21 PB/s of memory bandwidth — extraordinary numbers. But 44 GB means Cerebras can hold the weights of models up to roughly 20–25B parameters resident on-chip in BF16, or a larger model in INT8. That works brilliantly for small-to-medium model FFN decode; it hits walls for 70B+ models or very long-context attention KV caches.
Deeply integrated into a heterogeneous cloud stack: another accelerator handles prefill and decode attention; Cerebras owns decode FFN. Avoids the KV capacity ceiling entirely.
Tries to stand alone as a universal inference replacement and gets pulled into the most difficult economics — large model KV capacity, long-context scaling, whole-system cost justification.
The Best Counterarguments
- HBM keeps improving. HBM3e at 4.8 TB/s and upcoming HBM4 will narrow the gap. The SRAM bandwidth advantage shrinks if HBM roadmaps hold. (Counter: HBM4 projected at ~10 TB/s; SRAM can scale too, and 10 TB/s still trails 20+ TB/s SRAM by 2×.)
- Integration is hard. Splitting inference across chips introduces scheduling, communication latency, compiler complexity, and software orchestration. The token vector handoff between GPU and SRAM accelerator is a real latency budget.
- Economics depend on utilisation. A specialist chip only makes economic sense if operators can keep it busy. Decode FFN is not 100% of serving time — idle time hurts TCO.
- Not every model behaves the same. Dense models, MoE models, long-context models, and agentic workloads stress the stack differently. A chip optimised for MoE FFN decode may underperform on dense-attention-heavy workloads.
Does Specialisation Shrink the TAM?
Counterintuitively, no. A narrower functional role can mean a larger practical market.
If an SRAM-route company insists on selling a full alternative inference platform, it competes for a relatively small number of premium system decisions. But if it becomes the preferred decode-FFN accelerator inside mainstream GPU, TPU, or cloud-ASIC stacks, it rides the much larger shipment volume of those ecosystems.
Bottom Line
The structural opportunity in AI chips is no longer about one memory technology defeating another in a clean head-to-head replacement battle. It is about workflow decomposition.
HBM-heavy systems remain the general-purpose backbone. SRAM-heavy systems increasingly look like the specialist accelerators that reclaim performance in bandwidth-constrained decode paths — particularly FFN and MoE decode, where arithmetic intensity is near zero and the weight footprint is fixed.
Selected Sources
- NVIDIA technical blog: Groq 3 LPX for the Vera Rubin platform
- NVIDIA LPX product page
- NVIDIA Newsroom: Vera Rubin platform, March 2026
- Groq: Inside the LPU
- AWS + Cerebras inference announcement
- AWS Neuron: disaggregated inference documentation
- Cerebras inference overview
- Neocortex CS-3 system specifications
- Marvell: custom SRAM for next-gen AI silicon