MANISH AI
← Writings RSS
Deep Dive · Systems ML · 2025–2026

The Next Inference Bottleneck Is Not FLOPs.
It Is the Shape of the Decode Loop.

Why the real opportunity isn't another SRAM chip — it's co-designing models, runtimes, and GPU kernels to use the memory hierarchy you already have.

June 2026 ~18 min read FP4 · MoE · SWA · Speculative Decoding · Persistent Kernels

Every few months, someone asks a fair question: why do we need yet another SRAM hardware idea for inference? The answer is: we may not. The more interesting thesis is that modern inference needs to use the SRAM already inside GPUs and accelerators far more intelligently — through a tight co-design of model architecture, quantization, runtime scheduling, and GPU kernels.

00

The Real Bottleneck Why FLOPs are the wrong metric

Once inference moves toward real-time token rates, the bottleneck stops being raw compute. What takes over is data movement, orchestration overhead, decode serialization, KV movement, and tiny idle gaps between GPU work. The GPU may be fast, but the system around it keeps creating bubbles.

SRAM is already inside the GPU — it exists in the form of fast on-chip staging structures: registers, shared memory, L1-level paths, and cache-local execution resources around the SMs. The goal isn't necessarily to add a new SRAM chip next to the GPU. The goal is to make the inference loop use existing on-chip SRAM better.

Core Thesis

SRAM is the GPU's fast local workspace. The real win is keeping the hot inference loop close to compute, instead of constantly moving data through slower memory and relaunching work from the CPU. That changes the framing from a hardware question to a model + quantization + runtime + GPU kernel question.

The issue isn't whether we have enough FLOPs. The issue is whether the hot inference loop is tiled, quantized, sparse, parallel, and resident close to compute.

The Five Layers of Modern Inference Optimization
L1
FP4 / NVFP4 Quantization
Shrink weight bandwidth — fewer bits moved per tile
Bandwidth
L2
Mixture of Experts (MoE)
Reduce active parameters per token — sparse activation
Sparsity
L3
Sliding Window Attention (SWA)
Bound KV cache growth — controlled memory footprint
Memory
L4
Parallel / Block Decoding
Widen the decode loop — draft, verify, commit multiple tokens
Throughput
L5
GPU-Resident Persistent Kernels
Keep the hot loop on-device — eliminate CPU orchestration gaps
Latency

L1

FP4 / NVFP4 Quantization Layer 1 — Shrink Weight Bandwidth

Large models are not just compute-heavy — they are movement-heavy. Every layer has to move weight tiles through the memory hierarchy: from HBM, through cache, into the compute-side local staging path where tensor cores can consume them. FP4 and NVFP4-style quantization reduce the number of bits moved per weight and activation tile, directly reducing pressure on the memory hierarchy.

NVIDIA's Blackwell architecture introduced native FP4 support in its 5th-generation Tensor Cores. NVFP4 — NVIDIA's custom FP4 format — uses block-wise micro-scaling: values are grouped into blocks of 16, each sharing a high-precision FP8 (E4M3) scaling factor, plus an additional per-tensor FP32 scale. This two-level scheme preserves dynamic range and reduces quantization error while keeping the data compact.

4.5
Effective bits per value (4-bit value + FP8 scale overhead)
2–3×
Arithmetic throughput boost vs FP8 on Blackwell
<1%
Accuracy degradation on DeepSeek-R1-0528 (PTQ, AIME 2024)

NVIDIA TensorRT Model Optimizer and LLM Compressor both offer streamlined workflows for quantizing models to NVFP4 via post-training quantization (PTQ). As of early 2025, NVFP4-quantized versions of DeepSeek-R1, Llama 3.3 70B, and Llama 3.1 405B are available on Hugging Face.

Key Insight

FP4/NVFP4 is not just a model compression trick. It is a bandwidth-shaping trick. It reduces pressure on the memory hierarchy and makes it easier to keep the GPU fed. For a 1T-class model, this is essential — without aggressive compression, the amount of weight movement per decode step is brutal.

Research into Quantization-Aware Distillation (QAD) for NVFP4 has also emerged to recover accuracy in smaller models where PTQ alone shows larger degradation. Approaches like RaZeR (Redundant Zero Remapping) and ARCQuant further push NVFP4 accuracy by optimizing the quantization grid itself. The format's fine-grained block-wise isolation prevents high-magnitude outliers from inflating scaling factors across entire tensors — a key advantage over coarse-grained integer quantization.


L2

Mixture of Experts Layer 2 — Reduce Active Parameters Per Token

A trillion-parameter model doesn't have to activate all trillion parameters for every token. That's the whole power of MoE. The model can preserve large total capacity while routing each token through only a subset of experts — changing the problem from "run the whole trillion-parameter model per token" to "route each token through the right small subset of total capacity."

The concrete scale of this efficiency gain is striking: DeepSeek-V3 has 671 billion total parameters but activates only 37 billion during inference, achieving GPT-4 level performance at a fraction of the compute cost. Mixtral-8x7B has 46.7 billion total parameters but only 13 billion active during inference — matching LLaMA-2-70B while running six times faster.

60%+
of open-source AI model releases in 2025 used MoE architecture
37B
Active params per token in DeepSeek-V3 (vs 671B total)
10×
Performance per-GPU for MoE inference on NVIDIA GB200 NVL72 vs H200

Since early 2025, nearly all leading frontier models use MoE designs — DeepSeek-R1, Kimi K2, Mistral Large 3, Llama 4, and Qwen3 all rely on sparse activation. The architecture has moved decisively from research prototype to production backbone.

The MoE Runtime Challenge

MoE reduces active compute and active memory movement per token — but it also increases the need for smarter runtime scheduling. Now the system must route tokens, load or stage expert tiles, handle irregularity, and avoid letting expert routing destroy locality. vLLM and TensorRT-LLM have both added native MoE optimizations, and expert parallelism across GPUs is becoming standard for frontier deployment.

The memory constraint is real: full DeepSeek-R1 requires roughly 13,719 GB/s of memory bandwidth at scale — achievable only on data-center systems. At batch size 1, requirements drop to ~1,040 GB/s. This is why inference-optimal MoE scaling (balancing expert count against serving efficiency) is an active research area — models with fewer, larger experts can be more serving-efficient even if they require more training compute.


L3

Sliding Window Attention Layer 3 — Bound KV Cache Growth

In classic full attention, the KV cache grows with context length: O(N) memory complexity means that as the context becomes longer, both the storage and movement costs of KV data keep increasing. Sliding-window attention changes this to O(C) — constant rather than growing — by restricting each token's attention to a fixed local window of size C.

This matters enormously during decode. The model might be quantized, the active expert set might be sparse, but if the KV path keeps growing without bound, the memory system still gets crushed at long context. SWA is the layer that says: do not let context length turn every decode step into an ever-growing memory problem.

// Standard attention: O(N) KV cache — grows with every token // At 128K context: ~13GB KV cache for a 70B model // Sliding window attention: O(window_size) KV cache — constant // At window=4096: bounded, predictable memory footprint // Gemma 3 approach: interleave local + global layers local_layers → sliding window (sw=1024), tiny KV cache global_layers → full attention, 1 per N local layers Result: 60% → <15% KV memory overhead at 32K context

Gemma 3's technical report shows that a 1:3 global-to-local ratio with a sliding window of 1024 tokens reduces KV cache memory overhead from 60% (global-only baseline) to under 15% at 32K context. Mistral-7B pioneered this approach with a 4096-token window, storing only a rolling buffer of recent KV pairs rather than the full history.

The dominant practical pattern in 2025 is hybrid attention: most layers use local/sliding-window attention for efficiency, while a smaller subset of layers use global attention to maintain long-range coherence. This lets models have large theoretical context windows while keeping actual KV memory manageable during decode.

SWA in practice

SWA alone doesn't replace global attention — long-range dependencies still need a path. The engineering insight is that most layers don't need full context most of the time. Restricting attention locally where possible, globally where necessary, makes the KV working set bounded and predictable — a prerequisite for sustainable high-throughput decode.


L4

Parallel Token / Block Decoding Layer 4 — The Most Important Layer for Throughput

Classic autoregressive decoding is painfully serial: generate token 1, then token 2, then token 3. Each step depends on the previous. This creates a narrow workload. Modern GPUs are enormous parallel machines — if the runtime only gives the GPU one thin token path at a time, compute sits underutilized even when the model is large.

Parallel token or block decoding makes decoding wider. Instead of producing exactly one next token per step, the runtime creates a larger unit of token work. Approaches include speculative decoding, multi-token prediction (MTP), blockwise decoding, parallel candidate branches, and diffusion-style token blocks.

Speculative Decoding

The core idea: a smaller, faster draft model proposes several future tokens cheaply and in parallel; a larger target model verifies them all in a single forward pass. If the draft tokens match what the target would have generated, you get multiple tokens for the latency cost of one. The output is mathematically identical to standard autoregressive decoding — it is a lossless speedup.

Speculative decoding has achieved 2–3× speedups in production settings. A 2025 paper at ICLR 2026 demonstrated up to 2.37× speedup even at batch size 256, challenging the conventional wisdom that speculative decoding only works in small-batch settings. The key finding: memory bandwidth, not compute, remains the dominant bottleneck even in large-batch inference.

Multi-Token Prediction (MTP) — Gemma 4's Approach

On May 5, 2026, Google released MTP drafter models for the entire Gemma 4 family: the 31B dense flagship, the 26B A4B Mixture-of-Experts model, and the on-device E2B/E4B edge models. Using speculative decoding powered by MTP, the drafters deliver up to 3× faster inference with zero degradation in output quality. The Gemma 4 MTP drafter is not independent — it shares the input embedding table with the target model and builds directly on its last-layer activations.

Gemma 4 MTP drafter speedup (dense variants) — zero quality regression
2.2×
Gemma 4 26B MoE drafter on Apple Silicon at batch sizes 4–8
2.37×
Speculative decoding speedup at batch size 256 (ICLR 2026 paper)

MoE models present an interesting nuance: because different tokens may activate different experts, verifying drafted tokens can require loading additional expert weights from memory, partially offsetting drafting gains at batch size 1. At larger batch sizes, there is more overlap in activated experts across sequences, improving reuse.

Why This Layer Matters Most

FP4 reduces weight bandwidth. MoE reduces active parameters. SWA bounds KV growth. But parallel token/block decoding creates enough simultaneous token work to chase very high throughput. Without this layer, a persistent kernel just makes a serial loop more efficient. With this layer, a persistent kernel has a real pipeline to run: draft tokens, route experts, run attention, verify candidates, commit tokens, update KV/state, prefetch the next tiles.

The field is rapidly developing variants: Medusa adds multiple MLP heads to reuse last hidden states; EAGLE builds on both last hidden state and preceding tokens for autoregressive drafting; FastMTP uses a shared-weight MTP head with language-aware vocabulary compression. The meta-pattern is clear — do not only optimize one-token-at-a-time decoding, make decoding wider.


L5

GPU-Resident Persistent Mega-Kernels Layer 5 — Remove Orchestration Gaps

A normal inference runtime involves repeated CPU-to-GPU orchestration: launch kernel → wait → sync → launch next kernel → wait → update state → launch again. At very high token rates, these gaps compound. The GPU may be fast, but the system around it keeps creating small stalls.

A GPU-resident persistent mega-kernel tries to keep the hot loop on the device. Instead of the CPU orchestrating every decode step, the GPU owns more of the loop continuously:

// CPU-orchestrated inference (old model) cpu: launch kernel → wait → sync → launch kernel → wait → ... ↑ small bubbles between every step at high token rates // GPU-resident persistent kernel (new model) gpu: load next tile → dequantize (FP4) → route experts (MoE) → run attention (SWA) → verify draft tokens (MTP/Spec) → commit accepted tokens → update KV/state → prefetch next block → [loop without CPU round-trip]

This matters because the goal is not just high average throughput — it's avoiding tiny stalls in the inner inference loop. Once the model is quantized, sparse, and block-parallel, the runtime must not keep bouncing through CPU orchestration on every decode step.

FlashInfer — The Current State of the Art

FlashInfer, which won best paper at MLSys 2025, is a library and kernel generator for LLM serving that provides high-performance GPU kernels for attention, GEMM, and MoE operations. NVIDIA is now actively releasing its most performant TensorRT-LLM kernels into FlashInfer for easy integration into vLLM, SGLang, and custom inference engines.

FlashInfer's JIT system builds on CUDA/Cutlass templates derived from FlashAttention-2/3, injecting user-defined transformations directly into templates before compilation to produce fused attention kernels specialized for the specific variant and cache layout. After the initial compilation, kernels run in the microsecond launch latency range. Its load-balanced scheduling decouples plan and run stages, alleviating the load imbalance issue that arises from variable-length inputs in heterogeneous batches.

The Synthesis

Persistent kernel execution is where all the other layers pay off. A kernel that stays GPU-resident can pipeline weight dequantization, expert routing, attention over bounded windows, multi-token draft verification, and KV cache updates — all without round-tripping to the CPU. The on-chip SRAM becomes what it should be: the fast local workspace that keeps the inference pipeline close to compute.


06

The Better Framing Deterministic, Tiled, Sparse, Quantized, Parallel, GPU-Resident

When someone asks "why build yet another SRAM hardware system for inference?" the answer is:

We may not need another chip. We need better use of the chips we already have.

The path to extremely high token throughput is not only about adding memory — it's about changing the shape of inference. Each layer of the stack addresses a different bottleneck, and they compound:

How the Layers Stack — What Each Removes
L1
FP4 / NVFP4
Removes: weight bandwidth pressure. Moves 2× fewer bits per tile on Blackwell hardware.
Bandwidth ↓
L2
Mixture of Experts
Removes: dense activation cost. Activates ~5% of total parameters per token in frontier models.
Active Params ↓
L3
Sliding Window Attention
Removes: unbounded KV cache growth. Constant O(window) memory regardless of context length.
KV Memory ↓
L4
Parallel Decoding / MTP
Removes: serial decode bottleneck. 2–3× tokens per elapsed second via speculative commit.
Tokens/s ↑
L5
Persistent GPU Kernels
Removes: CPU orchestration latency. The inner loop never leaves the device.
Stalls ↓

SRAM in this framing is not necessarily a new chip. It is the on-chip staging layer that makes the loop fast when the runtime is designed correctly. The real thesis is not "SRAM as a separate hardware religion." The real thesis is deterministic, tiled, sparse, quantized, parallel, GPU-resident inference.

The Takeaway

Make weights smaller. Make active parameters sparse. Make attention bounded. Make decoding parallel. Make the hot loop GPU-resident. When those five things are true simultaneously, SRAM becomes the fast local workspace it was designed to be — and the inference pipeline runs close to the theoretical limit of the hardware.

References & Sources
01
Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
NVIDIA Developer Blog · Alvarez et al. · 2025
developer.nvidia.com/blog/introducing-nvfp4...
02
NVIDIA Model Optimizer — NVFP4 PTQ Workflows for TensorRT-LLM & vLLM
GitHub · NVIDIA · 2025
github.com/NVIDIA/Model-Optimizer
03
Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery
NVIDIA Research · arXiv:2601.20088 · 2026
arxiv.org/pdf/2601.20088
04
RaZeR: Pushing the Limits of NVFP4 Quantization with Redundant Zero Remapping
arXiv:2501.04052 · 2025
arxiv.org/pdf/2501.04052
05
ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs
arXiv:2601.07475 · 2026
arxiv.org/pdf/2601.07475
06
DeepSeek-V3 Technical Report
DeepSeek-AI · arXiv:2412.19437 · 2024
arxiv.org/abs/2412.19437
07
Mixture of Experts Powers the Most Intelligent Frontier AI Models — 10x Faster on Blackwell NVL72
NVIDIA Developer Blog · March 2026
blogs.nvidia.com/blog/mixture-of-experts-frontier-models/
08
Toward Inference-Optimal Mixture-of-Expert Large Language Models
arXiv:2404.02852 · 2024
arxiv.org/pdf/2404.02852
09
Gemma 3 Technical Report — Sliding Window Attention & KV Cache Analysis
Google DeepMind · arXiv:2503.19786 · 2025
arxiv.org/pdf/2503.19786
10
You Only Cache Once (YOCO) — Sliding-Window Attention O(C) KV Complexity
arXiv:2405.05254 · 2024
arxiv.org/pdf/2405.05254
11
Accelerating Gemma 4: Faster Inference with Multi-Token Prediction Drafters
Google DeepMind Blog · May 5, 2026
blog.google/...multi-token-prediction-gemma-4/
12
Speed-up Gemma 4 with Multi-Token Prediction — Official Documentation
Google AI for Developers · 2026
ai.google.dev/gemma/docs/mtp/overview
13
Better & Faster Large Language Models via Multi-Token Prediction
Gloeckle et al. (Meta FAIR) · arXiv:2404.19737 · 2024
arxiv.org/pdf/2404.19737
14
Rethinking High-Throughput LLM Inference: An Opportunity for Speculative Decoding
ICLR 2026 · 2.37× speedup at batch size 256
openreview.net/forum?id=59OJOgKLzN
15
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Ye et al. · Best Paper MLSys 2025 · arXiv:2501.01005
arxiv.org/abs/2501.01005
16
Run High-Performance LLM Inference Kernels from NVIDIA Using FlashInfer
NVIDIA Technical Blog · June 2025
developer.nvidia.com/blog/...flashinfer/
17
FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
arXiv:2509.18362 · 2025
arxiv.org/pdf/2509.18362
18
NVIDIA Blackwell: The Impact of NVFP4 For LLM Inference
Edge AI and Vision Alliance · October 2025
edge-ai-vision.com/...nvfp4-for-llm-inference/
19
Mixture of Experts Infrastructure — Scaling Sparse Models Guide
Introl Blog · Updated February 2026
introl.com/blog/mixture-of-experts-moe-infrastructure...
20
MoE-Inference-Bench: Performance Evaluation of Mixture of Expert LLMs & VLMs
arXiv:2508.17467 · August 2025
arxiv.org/html/2508.17467v1