Computer Architecture LLM Inference Memory Systems SRAM · HBM · CXL

SRMIC-X1: A Residency-First Architecture for LLM Decode Acceleration

What happens when you redesign the memory hierarchy around the actual access patterns of autoregressive inference — replacing HBM as the primary decode tier with distributed on-package SRAM, connected by a purpose-built fabric with 2× the bandwidth?

Manish KL April 2026 15 min read github.com/manishklach/SRMIC_X1 ↗

LLM inference has a memory problem, not a compute problem. The arithmetic intensity of generating a single token — the ratio of floating-point operations to bytes of memory accessed — is approximately one FLOP per byte. No modern accelerator is remotely compute-bound at that intensity. Every GPU, TPU, and NPU doing decode today is waiting for memory, not crunching numbers.

The root cause is structural. Contemporary LLMs have weight footprints of 100 GB to 700 GB. The on-chip SRAM available on a single accelerator die is 10–100 MB. The gap is three to four orders of magnitude. At each generation step, the model needs the same complete set of weight tensors — and because they don't fit on-chip, they get fetched from HBM, used once, evicted, and fetched again at the next step. For a 512-token output sequence, every weight tensor is loaded 512 times.

SRMIC-X1 is a proposed accelerator architecture designed specifically around this reality. The central idea: instead of treating HBM as the primary memory tier for decode, interpose a distributed on-package SRAM layer (HRM — Hot Residency Memory) between HBM and the compute engines, connected by a purpose-built high-bandwidth fabric (SRMESH) that exceeds HBM bandwidth. This repository contains an analytical simulator that models the decode-step latency impact of that architecture across different model sizes and HRM capacity provisioning points.

"The question isn't whether SRAM is faster than HBM. It always is. The question is whether you can architect a system where the hot working set of LLM decode fits in SRAM — and whether the fabric connecting compute to that SRAM is fast enough to matter."

The Architecture

Three-Tier Memory Stack

SRMIC-X1 organises memory into three explicit tiers, each with a distinct role in the decode pipeline:

SRMIC-X1 Memory Hierarchy — Full Stack
TENSOR INFERENCE CLUSTERS (TIC ×128) compute-bound in arithmetic · memory-bound in practice · 4 TICs per HRM region SRMESH FABRIC 48,000 GB/s aggregate · 3-level hierarchy · 5 virtual channels · decode-critical priority HRM R0 128 MB SRAM · RIC HRM R1 128 MB SRAM · RIC HRM R2 128 MB SRAM · RIC HRM R3 128 MB SRAM · RIC ··· HRM R14 128 MB SRAM · RIC HRM R15 128 MB SRAM · RIC HRM TIER 16 regions × 128 MB = 2 GB total HRM · Each region independently serviceable · Parallel operation miss / cold page handler HBM COLD TIER 24,000 GB/s aggregate · 8 stacks × 3 TB/s HBM3e · cold weights, overflow KV state, page staging pooled warm expansion CXL WARM TIER (optional) 64 GB/s · CXL Gen5 ×16 · bounded page-window semantics · hardware-enforced access 3 TB/s per region

The critical design choice is the bandwidth ordering: SRMESH (48 TB/s) intentionally exceeds HBM (24 TB/s) by 2×. This is not an engineering optimisation — it is a structural necessity. If SRMESH were slower than HBM, every HRM hit would be slower than a direct HBM access, and the architecture would be strictly worse than the baseline. The invariant SRMESH_BW ≥ HBM_BW is a hard architectural constraint, not a tunable parameter.

The HRM Region and the RIC

Each HRM region is 128 MB of SRAM, independently serviceable, with a local Residency Interconnect Controller (RIC) — a hardware state machine that handles promote, demote, pin, and multicast page operations without involving the host CPU on the decode critical path. There are 16 regions (4 chiplets × 4 regions per chiplet), giving 2 GB of total HRM capacity in the reference configuration.

HRM Region Internal Structure
SRMESH PORT 3 TB/s ingress RIC Residency Interconnect Controller · promote / demote · pin / unpin · multicast · no host CPU on hot path SRAM BANK ARRAY 128 MB · multi-bank B0 B1 B2 ··· Bn TIC PORT 4 TICs per region PER-PAGE HARDWARE METADATA · page_id · hotness_counter · pin_status · source_tier_indicator · tenant_domain · format_state · demotion_legal · version_epoch HBM MISS PATH fill / evict / spill THERMAL SENSOR per-bank · feeds RIC

The per-page metadata is what elevates the HRM above a simple cache. Every page carries a hotness counter, pin status, source tier indicator, tenant domain, format state, and demotion-legal flag. This gives the RIC the information it needs to make semantic residency decisions — not just passive LRU replacement — without any software involvement on the hot path. Thermal feedback from per-bank sensors also feeds into the RIC, allowing the architecture to degrade gracefully under thermal pressure rather than stalling.

The SRMESH Fabric

SRMESH is a three-level on-package interconnect designed specifically for decode-critical traffic patterns. Its key properties:

48,000 GB/s
aggregate bandwidth — 2× HBM
3
hierarchy levels: intra-region, inter-chiplet, global
5
virtual channel classes with decode-critical priority enforcement
0.003 ms
fixed cross-region hop overhead

The five virtual channels allow decode-critical traffic to be separated from cold-page fills, KV cache moves, speculative prefetches, and management traffic. Decode-critical reads never wait behind a bulk cold-page fill — the fabric enforces this at the hardware level.

The Analytical Model

v20: The Bounded Working-Set Correction

Earlier simulator versions (v10–v19) contained a fundamental bandwidth accounting error: HRM traffic was modeled as flowing serially through a single aggregate pipe, causing HRM latency to grow with occupancy. This produced non-monotonic and physically incorrect speedup curves. v20 corrects this with a key insight about how LLM decode actually works.

The per-region working set is fixed by model architecture, not by HRM occupancy. A 70B INT8 model with a 15% active weight fraction per decode step has a 10.5 GB active set. Across 16 HRM regions, each region's share is fixed at ~656 MB — regardless of how much of the total HRM is filled. This means:

Core latency model — v20
# Per-region working set: fixed by model, not by HRM occupancy
active_bytes_per_region = active_weights / hrm_regions

# HRM hit ratio controls what fraction is served from fast SRAM
hrm_bytes_per_region    = active_bytes_per_region * hrm_hit_ratio
hbm_bytes               = active_weights * (1.0 - hrm_hit_ratio)

# Regions operate in PARALLEL — time is per-region, not aggregate
hrm_time = (hrm_bytes_per_region / hrm_bw_per_region) + fabric_overhead
hbm_time = hbm_bytes / hbm_bw_aggregate
cxl_time = cxl_bytes / cxl_bw

# Decode step time: slowest tier determines latency (critical path)
mem_time = max(hrm_time, hbm_time, cxl_time, kv_time)

# Speedup over HBM-only baseline
speedup  = hbm_only_time / mem_time

The crucial fix: higher HRM hit ratio reduces hbm_bytes without inflating hrm_time, because each region's per-region load is invariant. This produces physically correct monotonic improvement up to the bottleneck crossover point — the point where HBM and HRM times equalise and adding more HRM capacity yields diminishing returns.

Key Insight

The bounded working-set invariant is the architectural property that makes SRMIC work. Because the per-token active weight set assigned to each HRM region is determined by model weight partitioning and is invariant with respect to total HRM occupancy, the math T_HRM = W_active/BW_SRMESH holds cleanly. Over-filling HRM regions doesn't increase their per-region service time — and that's what enables the speedup to be monotonic rather than peaking and collapsing.

Hardware Parameters (Reference Config P1)

ParameterValueBasis
HBM aggregate bandwidth24,000 GB/s8 stacks × 3 TB/s (HBM3e class)
SRMESH aggregate bandwidth48,000 GB/sOn-package SRAM fabric — no PHY overhead, short wires, 2× HBM
HRM regions164 chiplets × 4 regions per chiplet
HRM capacity per region128 MB SRAM→ 2 GB total HRM
SRMESH BW per region3,000 GB/s48,000 / 16 regions
Tensor inference clusters128 total4 TICs per HRM region
CXL bandwidth64 GB/sCXL Gen5 ×16 — warm pooled tier
Fixed fabric overhead0.003 msCross-region hop latency (model assumption)

Simulation Results

7B Model Class — Active Set ≈ 2.45 GB (35% of 7 GB)

For 7B INT8 class models, the active weight set per decode step is approximately 2.45 GB. The 7B study sweeps HRM capacity from 125 MB to 2 GB+ and tracks latency, speedup, and the active bottleneck at each point.

HRM SizeHBM-only (ms)SRMIC (ms)SpeedupBottleneck
0.125 GB0.1020.0971.05×HBM
0.25 GB0.1020.0921.11×HBM
0.5 GB0.1020.0811.26×HBM
0.75 GB0.1020.0711.44×HBM
1.0 GB ★0.1020.0621.63×CXL ← plateau begins
1.5 GB+0.1020.0621.63×CXL (floor)
7B Model — Speedup vs HRM Capacity
1.0× 1.2× 1.4× 1.6× 1.8× Speedup over HBM-only 0.125 0.25 0.5 0.75 1.0 1.5+ HRM Capacity (GB) baseline 1.63× peak CXL floor optimal point

The 7B curve saturates cleanly at 1 GB of HRM. Beyond that, the CXL tier becomes the bottleneck floor and adding more SRAM yields no further improvement. The optimal provisioning point for 7B class models at this configuration is exactly 1 GB — which is ~41% of the active weight set.

70B Model Class — Active Set ≈ 10.5 GB (15% of 70 GB)

The 70B study is where the architecture's behaviour becomes analytically interesting. The larger active set means the bottleneck crossover happens at a higher HRM capacity, and the curve reveals both a peak and a stable settling floor.

HRM SizeHBM-only (ms)SRMIC (ms)SpeedupBottleneck
1 GB0.4380.3961.11×HBM
2 GB0.4380.3541.24×HBM
3 GB0.4380.3121.40×HBM
4 GB0.4380.2711.62×HBM
5 GB0.4380.2291.91×HBM
6 GB0.4380.1882.33×HBM
7 GB ★0.4380.1492.94×HRM ← peak crossover
8 GB0.4380.1702.58×HRM
9 GB0.4380.1912.30×HRM
10 GB0.4380.2112.07×HRM
11+ GB0.4380.2221.97×HRM (floor)
70B Model — Speedup vs HRM Capacity · Crossover Analysis
HBM-BOUND PHASE HRM-BOUND PHASE 1.0× 1.5× 2.0× 2.5× 3.0× Speedup over HBM-only 1 2 3 4 5 6 7★ 8 9 10 11+ HRM Capacity (GB) 2.94× peak 1.97× floor crossover hrm_t ≈ hbm_t

Key Findings and Architectural Implications

Finding 01
SRMIC outperforms HBM-only at every provisioning point
At every HRM capacity tested — from 125 MB to full active-set coverage — SRMIC is at minimum 1.05× faster than HBM-only for both 7B and 70B class models. There is no HRM size at which the architecture performs worse than baseline. This holds even in the HRM-bound settling phase.
Finding 02
Optimal provisioning is at the HBM/HRM crossover
The natural provisioning target is the HRM capacity where HBM time and HRM time equalise: 7 GB for 70B class at P1 config. Beyond this point, additional HRM capacity shifts remaining traffic to slower per-region SRAM paths rather than displacing HBM traffic. Over-provisioning is wasteful; under-provisioning leaves speedup on the table.
Finding 03
SRMESH/HBM bandwidth ratio governs theoretical ceiling
At saturation, peak speedup approaches SRMESH_BW / HBM_BW. At the 2× ratio in P1 (48 TB/s vs 24 TB/s), the theoretical maximum is 2.0× at full-set coverage — and the simulation achieves 1.97×, consistent. The minimum viable ratio for any speedup is ~1.01×. The P1 baseline at 2× provides headroom for contention and thermal effects not modelled here.
Finding 04
The bottleneck crossover is physics, not a modelling artifact
The 70B peak-then-settle curve reflects real tiered memory behaviour. HBM-bound phase: each GB of HRM displaces HBM traffic, latency drops sharply. HRM-bound phase: HBM traffic eliminated, adding more HRM shifts residual traffic to per-region paths. The floor is still 1.97× — SRMIC never loses to HBM-only baseline.

The Bottleneck Crossover — Visualised

Bottleneck Phase Transition — 70B at Optimal 7 GB HRM
PHASE 1 · HBM-BOUND Each GB of HRM added displaces HBM traffic hrm_time << hbm_time → HBM is the critical path Latency drops sharply Speedup climbs steeply High-leverage operating region Add HRM → immediate payoff crossover hrm_t ≈ hbm_t PHASE 2 · HRM-BOUND HBM traffic largely eliminated, HRM is critical path hrm_time > hbm_time → HRM is the critical path Speedup settles to 1.97× floor Still 1.97× better than baseline Diminishing returns on more HRM Optimal: stop at crossover point

What This Simulator Doesn't Model

The README is transparent about the analytical model's scope, and that honesty is worth preserving here. This is a first-order latency model — not a cycle-accurate or trace-driven simulator. The following are explicitly out of scope:

Not modelledExpected impact directionFuture work
SRAM bank conflictsWould reduce effective HRM bandwidthBank conflict model with access pattern distribution
Fabric arbitration contentionWould increase hrm_time at high utilisationContention exponent; multi-tenant load model
Thermal throttlingWould reduce sustained bandwidth over timeThermal derating curve per workload duration
Multi-tenant HRM pressureWould reduce effective hit ratioTenant quota model with fair-share scheduling
KV cache growth at long contextWould shift bottleneck tier at high context lengthsContext-length sweep study
Prefetch effectivenessWould improve effective hit ratioPrefetch model with speculative fetch accuracy

The findings are presented as architectural hypotheses requiring validation against trace-driven simulation and eventually silicon measurement — not as production benchmarks. The value of this simulator is in identifying the right design space and the right provisioning targets to explore, not in predicting exact silicon numbers.

Relationship to the Wired Residency Primitive

SRMIC-X1 is the concrete accelerator architecture within which the hardware enforcement primitive described in the companion patent becomes physically instantiable. The RIC's pin operations are one possible implementation surface for the BIND instruction; the per-page hardware metadata provides the residency tag fields; the SRMESH fabric provides the bandwidth substrate that makes the HRM tier viable as a primary decode tier rather than just a prefetch buffer.

The two are complementary but independently motivated. SRMIC-X1 specifies the architecture. The wired residency primitive specifies the contractual interface between the runtime and the hardware — the mechanism by which a runtime can declare "this page must not leave HRM" and have that declaration enforced unconditionally by the RIC's replacement logic.


Repository

Full source code, simulation studies for both 7B and 70B model classes, architecture whitepaper, formal spec, and chiplet concept document available at github.com/manishklach/SRMIC_X1. The core analytical model is in core/decode_core.py. Study results including speedup curves and latency plots are in studies/7b/ and studies/70b/.