SRMIC-X1: Rethinking the Memory Hierarchy for LLM Decode | Writings

LLM inference has a memory problem, not a compute problem. The arithmetic intensity of generating a single token — the ratio of floating-point operations to bytes of memory accessed — is approximately one FLOP per byte. No modern accelerator is remotely compute-bound at that intensity. Every GPU, TPU, and NPU doing decode today is waiting for memory, not crunching numbers.

The root cause is structural. Contemporary LLMs have weight footprints of 100 GB to 700 GB. The on-chip SRAM available on a single accelerator die is 10–100 MB. The gap is three to four orders of magnitude. At each generation step, the model needs the same complete set of weight tensors — and because they don't fit on-chip, they get fetched from HBM, used once, evicted, and fetched again at the next step. For a 512-token output sequence, every weight tensor is loaded 512 times.

SRMIC-X1 is a proposed accelerator architecture designed specifically around this reality. The central idea: instead of treating HBM as the primary memory tier for decode, interpose a distributed on-package SRAM layer (HRM — Hot Residency Memory) between HBM and the compute engines, connected by a purpose-built high-bandwidth fabric (SRMESH) that exceeds HBM bandwidth. This repository contains an analytical simulator that models the decode-step latency impact of that architecture across different model sizes and HRM capacity provisioning points.

"The question isn't whether SRAM is faster than HBM. It always is. The question is whether you can architect a system where the hot working set of LLM decode fits in SRAM — and whether the fabric connecting compute to that SRAM is fast enough to matter."

The Architecture

Three-Tier Memory Stack

SRMIC-X1 organises memory into three explicit tiers, each with a distinct role in the decode pipeline:

SRMIC-X1 Memory Hierarchy — Full Stack

The critical design choice is the bandwidth ordering: SRMESH (48 TB/s) intentionally exceeds HBM (24 TB/s) by 2×. This is not an engineering optimisation — it is a structural necessity. If SRMESH were slower than HBM, every HRM hit would be slower than a direct HBM access, and the architecture would be strictly worse than the baseline. The invariant SRMESH_BW ≥ HBM_BW is a hard architectural constraint, not a tunable parameter.

The HRM Region and the RIC

Each HRM region is 128 MB of SRAM, independently serviceable, with a local Residency Interconnect Controller (RIC) — a hardware state machine that handles promote, demote, pin, and multicast page operations without involving the host CPU on the decode critical path. There are 16 regions (4 chiplets × 4 regions per chiplet), giving 2 GB of total HRM capacity in the reference configuration.

HRM Region Internal Structure

The per-page metadata is what elevates the HRM above a simple cache. Every page carries a hotness counter, pin status, source tier indicator, tenant domain, format state, and demotion-legal flag. This gives the RIC the information it needs to make semantic residency decisions — not just passive LRU replacement — without any software involvement on the hot path. Thermal feedback from per-bank sensors also feeds into the RIC, allowing the architecture to degrade gracefully under thermal pressure rather than stalling.

The SRMESH Fabric

SRMESH is a three-level on-package interconnect designed specifically for decode-critical traffic patterns. Its key properties:

48,000 GB/s

aggregate bandwidth — 2× HBM

hierarchy levels: intra-region, inter-chiplet, global

virtual channel classes with decode-critical priority enforcement

0.003 ms

fixed cross-region hop overhead

The five virtual channels allow decode-critical traffic to be separated from cold-page fills, KV cache moves, speculative prefetches, and management traffic. Decode-critical reads never wait behind a bulk cold-page fill — the fabric enforces this at the hardware level.

The Analytical Model

v20: The Bounded Working-Set Correction

Earlier simulator versions (v10–v19) contained a fundamental bandwidth accounting error: HRM traffic was modeled as flowing serially through a single aggregate pipe, causing HRM latency to grow with occupancy. This produced non-monotonic and physically incorrect speedup curves. v20 corrects this with a key insight about how LLM decode actually works.

The per-region working set is fixed by model architecture, not by HRM occupancy. A 70B INT8 model with a 15% active weight fraction per decode step has a 10.5 GB active set. Across 16 HRM regions, each region's share is fixed at ~656 MB — regardless of how much of the total HRM is filled. This means:

Core latency model — v20

# Per-region working set: fixed by model, not by HRM occupancy
active_bytes_per_region = active_weights / hrm_regions

# HRM hit ratio controls what fraction is served from fast SRAM
hrm_bytes_per_region    = active_bytes_per_region * hrm_hit_ratio
hbm_bytes               = active_weights * (1.0 - hrm_hit_ratio)

# Regions operate in PARALLEL — time is per-region, not aggregate
hrm_time = (hrm_bytes_per_region / hrm_bw_per_region) + fabric_overhead
hbm_time = hbm_bytes / hbm_bw_aggregate
cxl_time = cxl_bytes / cxl_bw

# Decode step time: slowest tier determines latency (critical path)
mem_time = max(hrm_time, hbm_time, cxl_time, kv_time)

# Speedup over HBM-only baseline
speedup  = hbm_only_time / mem_time

The crucial fix: higher HRM hit ratio reduces hbm_bytes without inflating hrm_time, because each region's per-region load is invariant. This produces physically correct monotonic improvement up to the bottleneck crossover point — the point where HBM and HRM times equalise and adding more HRM capacity yields diminishing returns.

Key Insight

The bounded working-set invariant is the architectural property that makes SRMIC work. Because the per-token active weight set assigned to each HRM region is determined by model weight partitioning and is invariant with respect to total HRM occupancy, the math T_HRM = W_active/BW_SRMESH holds cleanly. Over-filling HRM regions doesn't increase their per-region service time — and that's what enables the speedup to be monotonic rather than peaking and collapsing.

Hardware Parameters (Reference Config P1)

Parameter	Value	Basis
HBM aggregate bandwidth	24,000 GB/s	8 stacks × 3 TB/s (HBM3e class)
SRMESH aggregate bandwidth	48,000 GB/s	On-package SRAM fabric — no PHY overhead, short wires, 2× HBM
HRM regions	16	4 chiplets × 4 regions per chiplet
HRM capacity per region	128 MB SRAM	→ 2 GB total HRM
SRMESH BW per region	3,000 GB/s	48,000 / 16 regions
Tensor inference clusters	128 total	4 TICs per HRM region
CXL bandwidth	64 GB/s	CXL Gen5 ×16 — warm pooled tier
Fixed fabric overhead	0.003 ms	Cross-region hop latency (model assumption)

Simulation Results

7B Model Class — Active Set ≈ 2.45 GB (35% of 7 GB)

For 7B INT8 class models, the active weight set per decode step is approximately 2.45 GB. The 7B study sweeps HRM capacity from 125 MB to 2 GB+ and tracks latency, speedup, and the active bottleneck at each point.

HRM Size	HBM-only (ms)	SRMIC (ms)	Speedup	Bottleneck
0.125 GB	0.102	0.097	1.05×	HBM
0.25 GB	0.102	0.092	1.11×	HBM
0.5 GB	0.102	0.081	1.26×	HBM
0.75 GB	0.102	0.071	1.44×	HBM
1.0 GB ★	0.102	0.062	1.63×	CXL ← plateau begins
1.5 GB+	0.102	0.062	1.63×	CXL (floor)

7B Model — Speedup vs HRM Capacity

The 7B curve saturates cleanly at 1 GB of HRM. Beyond that, the CXL tier becomes the bottleneck floor and adding more SRAM yields no further improvement. The optimal provisioning point for 7B class models at this configuration is exactly 1 GB — which is ~41% of the active weight set.

70B Model Class — Active Set ≈ 10.5 GB (15% of 70 GB)

The 70B study is where the architecture's behaviour becomes analytically interesting. The larger active set means the bottleneck crossover happens at a higher HRM capacity, and the curve reveals both a peak and a stable settling floor.

HRM Size	HBM-only (ms)	SRMIC (ms)	Speedup	Bottleneck
1 GB	0.438	0.396	1.11×	HBM
2 GB	0.438	0.354	1.24×	HBM
3 GB	0.438	0.312	1.40×	HBM
4 GB	0.438	0.271	1.62×	HBM
5 GB	0.438	0.229	1.91×	HBM
6 GB	0.438	0.188	2.33×	HBM
7 GB ★	0.438	0.149	2.94×	HRM ← peak crossover
8 GB	0.438	0.170	2.58×	HRM
9 GB	0.438	0.191	2.30×	HRM
10 GB	0.438	0.211	2.07×	HRM
11+ GB	0.438	0.222	1.97×	HRM (floor)

70B Model — Speedup vs HRM Capacity · Crossover Analysis

Key Findings and Architectural Implications

Finding 01

SRMIC outperforms HBM-only at every provisioning point

At every HRM capacity tested — from 125 MB to full active-set coverage — SRMIC is at minimum 1.05× faster than HBM-only for both 7B and 70B class models. There is no HRM size at which the architecture performs worse than baseline. This holds even in the HRM-bound settling phase.

Finding 02

Optimal provisioning is at the HBM/HRM crossover

The natural provisioning target is the HRM capacity where HBM time and HRM time equalise: 7 GB for 70B class at P1 config. Beyond this point, additional HRM capacity shifts remaining traffic to slower per-region SRAM paths rather than displacing HBM traffic. Over-provisioning is wasteful; under-provisioning leaves speedup on the table.

Finding 03

SRMESH/HBM bandwidth ratio governs theoretical ceiling

At saturation, peak speedup approaches SRMESH_BW / HBM_BW. At the 2× ratio in P1 (48 TB/s vs 24 TB/s), the theoretical maximum is 2.0× at full-set coverage — and the simulation achieves 1.97×, consistent. The minimum viable ratio for any speedup is ~1.01×. The P1 baseline at 2× provides headroom for contention and thermal effects not modelled here.

Finding 04

The bottleneck crossover is physics, not a modelling artifact

The 70B peak-then-settle curve reflects real tiered memory behaviour. HBM-bound phase: each GB of HRM displaces HBM traffic, latency drops sharply. HRM-bound phase: HBM traffic eliminated, adding more HRM shifts residual traffic to per-region paths. The floor is still 1.97× — SRMIC never loses to HBM-only baseline.

The Bottleneck Crossover — Visualised

Bottleneck Phase Transition — 70B at Optimal 7 GB HRM

What This Simulator Doesn't Model

The README is transparent about the analytical model's scope, and that honesty is worth preserving here. This is a first-order latency model — not a cycle-accurate or trace-driven simulator. The following are explicitly out of scope:

Not modelled	Expected impact direction	Future work
SRAM bank conflicts	Would reduce effective HRM bandwidth	Bank conflict model with access pattern distribution
Fabric arbitration contention	Would increase hrm_time at high utilisation	Contention exponent; multi-tenant load model
Thermal throttling	Would reduce sustained bandwidth over time	Thermal derating curve per workload duration
Multi-tenant HRM pressure	Would reduce effective hit ratio	Tenant quota model with fair-share scheduling
KV cache growth at long context	Would shift bottleneck tier at high context lengths	Context-length sweep study
Prefetch effectiveness	Would improve effective hit ratio	Prefetch model with speculative fetch accuracy

The findings are presented as architectural hypotheses requiring validation against trace-driven simulation and eventually silicon measurement — not as production benchmarks. The value of this simulator is in identifying the right design space and the right provisioning targets to explore, not in predicting exact silicon numbers.

Relationship to the Wired Residency Primitive

SRMIC-X1 is the concrete accelerator architecture within which the hardware enforcement primitive described in the companion patent becomes physically instantiable. The RIC's pin operations are one possible implementation surface for the BIND instruction; the per-page hardware metadata provides the residency tag fields; the SRMESH fabric provides the bandwidth substrate that makes the HRM tier viable as a primary decode tier rather than just a prefetch buffer.

The two are complementary but independently motivated. SRMIC-X1 specifies the architecture. The wired residency primitive specifies the contractual interface between the runtime and the hardware — the mechanism by which a runtime can declare "this page must not leave HRM" and have that declaration enforced unconditionally by the RIC's replacement logic.

Repository

Full source code, simulation studies for both 7B and 70B model classes, architecture whitepaper, formal spec, and chiplet concept document available at github.com/manishklach/SRMIC_X1. The core analytical model is in core/decode_core.py. Study results including speedup curves and latency plots are in studies/7b/ and studies/70b/.