LLM inference has a memory problem, not a compute problem. The arithmetic intensity of generating a single token — the ratio of floating-point operations to bytes of memory accessed — is approximately one FLOP per byte. No modern accelerator is remotely compute-bound at that intensity. Every GPU, TPU, and NPU doing decode today is waiting for memory, not crunching numbers.
The root cause is structural. Contemporary LLMs have weight footprints of 100 GB to 700 GB. The on-chip SRAM available on a single accelerator die is 10–100 MB. The gap is three to four orders of magnitude. At each generation step, the model needs the same complete set of weight tensors — and because they don't fit on-chip, they get fetched from HBM, used once, evicted, and fetched again at the next step. For a 512-token output sequence, every weight tensor is loaded 512 times.
SRMIC-X1 is a proposed accelerator architecture designed specifically around this reality. The central idea: instead of treating HBM as the primary memory tier for decode, interpose a distributed on-package SRAM layer (HRM — Hot Residency Memory) between HBM and the compute engines, connected by a purpose-built high-bandwidth fabric (SRMESH) that exceeds HBM bandwidth. This repository contains an analytical simulator that models the decode-step latency impact of that architecture across different model sizes and HRM capacity provisioning points.
"The question isn't whether SRAM is faster than HBM. It always is. The question is whether you can architect a system where the hot working set of LLM decode fits in SRAM — and whether the fabric connecting compute to that SRAM is fast enough to matter."
The Architecture
Three-Tier Memory Stack
SRMIC-X1 organises memory into three explicit tiers, each with a distinct role in the decode pipeline:
The critical design choice is the bandwidth ordering: SRMESH (48 TB/s) intentionally exceeds HBM (24 TB/s) by 2×. This is not an engineering optimisation — it is a structural necessity. If SRMESH were slower than HBM, every HRM hit would be slower than a direct HBM access, and the architecture would be strictly worse than the baseline. The invariant SRMESH_BW ≥ HBM_BW is a hard architectural constraint, not a tunable parameter.
The HRM Region and the RIC
Each HRM region is 128 MB of SRAM, independently serviceable, with a local Residency Interconnect Controller (RIC) — a hardware state machine that handles promote, demote, pin, and multicast page operations without involving the host CPU on the decode critical path. There are 16 regions (4 chiplets × 4 regions per chiplet), giving 2 GB of total HRM capacity in the reference configuration.
The per-page metadata is what elevates the HRM above a simple cache. Every page carries a hotness counter, pin status, source tier indicator, tenant domain, format state, and demotion-legal flag. This gives the RIC the information it needs to make semantic residency decisions — not just passive LRU replacement — without any software involvement on the hot path. Thermal feedback from per-bank sensors also feeds into the RIC, allowing the architecture to degrade gracefully under thermal pressure rather than stalling.
The SRMESH Fabric
SRMESH is a three-level on-package interconnect designed specifically for decode-critical traffic patterns. Its key properties:
The five virtual channels allow decode-critical traffic to be separated from cold-page fills, KV cache moves, speculative prefetches, and management traffic. Decode-critical reads never wait behind a bulk cold-page fill — the fabric enforces this at the hardware level.
The Analytical Model
v20: The Bounded Working-Set Correction
Earlier simulator versions (v10–v19) contained a fundamental bandwidth accounting error: HRM traffic was modeled as flowing serially through a single aggregate pipe, causing HRM latency to grow with occupancy. This produced non-monotonic and physically incorrect speedup curves. v20 corrects this with a key insight about how LLM decode actually works.
The per-region working set is fixed by model architecture, not by HRM occupancy. A 70B INT8 model with a 15% active weight fraction per decode step has a 10.5 GB active set. Across 16 HRM regions, each region's share is fixed at ~656 MB — regardless of how much of the total HRM is filled. This means:
# Per-region working set: fixed by model, not by HRM occupancy active_bytes_per_region = active_weights / hrm_regions # HRM hit ratio controls what fraction is served from fast SRAM hrm_bytes_per_region = active_bytes_per_region * hrm_hit_ratio hbm_bytes = active_weights * (1.0 - hrm_hit_ratio) # Regions operate in PARALLEL — time is per-region, not aggregate hrm_time = (hrm_bytes_per_region / hrm_bw_per_region) + fabric_overhead hbm_time = hbm_bytes / hbm_bw_aggregate cxl_time = cxl_bytes / cxl_bw # Decode step time: slowest tier determines latency (critical path) mem_time = max(hrm_time, hbm_time, cxl_time, kv_time) # Speedup over HBM-only baseline speedup = hbm_only_time / mem_time
The crucial fix: higher HRM hit ratio reduces hbm_bytes without inflating hrm_time, because each region's per-region load is invariant. This produces physically correct monotonic improvement up to the bottleneck crossover point — the point where HBM and HRM times equalise and adding more HRM capacity yields diminishing returns.
The bounded working-set invariant is the architectural property that makes SRMIC work. Because the per-token active weight set assigned to each HRM region is determined by model weight partitioning and is invariant with respect to total HRM occupancy, the math T_HRM = W_active/BW_SRMESH holds cleanly. Over-filling HRM regions doesn't increase their per-region service time — and that's what enables the speedup to be monotonic rather than peaking and collapsing.
Hardware Parameters (Reference Config P1)
| Parameter | Value | Basis |
|---|---|---|
| HBM aggregate bandwidth | 24,000 GB/s | 8 stacks × 3 TB/s (HBM3e class) |
| SRMESH aggregate bandwidth | 48,000 GB/s | On-package SRAM fabric — no PHY overhead, short wires, 2× HBM |
| HRM regions | 16 | 4 chiplets × 4 regions per chiplet |
| HRM capacity per region | 128 MB SRAM | → 2 GB total HRM |
| SRMESH BW per region | 3,000 GB/s | 48,000 / 16 regions |
| Tensor inference clusters | 128 total | 4 TICs per HRM region |
| CXL bandwidth | 64 GB/s | CXL Gen5 ×16 — warm pooled tier |
| Fixed fabric overhead | 0.003 ms | Cross-region hop latency (model assumption) |
Simulation Results
7B Model Class — Active Set ≈ 2.45 GB (35% of 7 GB)
For 7B INT8 class models, the active weight set per decode step is approximately 2.45 GB. The 7B study sweeps HRM capacity from 125 MB to 2 GB+ and tracks latency, speedup, and the active bottleneck at each point.
| HRM Size | HBM-only (ms) | SRMIC (ms) | Speedup | Bottleneck |
|---|---|---|---|---|
| 0.125 GB | 0.102 | 0.097 | 1.05× | HBM |
| 0.25 GB | 0.102 | 0.092 | 1.11× | HBM |
| 0.5 GB | 0.102 | 0.081 | 1.26× | HBM |
| 0.75 GB | 0.102 | 0.071 | 1.44× | HBM |
| 1.0 GB ★ | 0.102 | 0.062 | 1.63× | CXL ← plateau begins |
| 1.5 GB+ | 0.102 | 0.062 | 1.63× | CXL (floor) |
The 7B curve saturates cleanly at 1 GB of HRM. Beyond that, the CXL tier becomes the bottleneck floor and adding more SRAM yields no further improvement. The optimal provisioning point for 7B class models at this configuration is exactly 1 GB — which is ~41% of the active weight set.
70B Model Class — Active Set ≈ 10.5 GB (15% of 70 GB)
The 70B study is where the architecture's behaviour becomes analytically interesting. The larger active set means the bottleneck crossover happens at a higher HRM capacity, and the curve reveals both a peak and a stable settling floor.
| HRM Size | HBM-only (ms) | SRMIC (ms) | Speedup | Bottleneck |
|---|---|---|---|---|
| 1 GB | 0.438 | 0.396 | 1.11× | HBM |
| 2 GB | 0.438 | 0.354 | 1.24× | HBM |
| 3 GB | 0.438 | 0.312 | 1.40× | HBM |
| 4 GB | 0.438 | 0.271 | 1.62× | HBM |
| 5 GB | 0.438 | 0.229 | 1.91× | HBM |
| 6 GB | 0.438 | 0.188 | 2.33× | HBM |
| 7 GB ★ | 0.438 | 0.149 | 2.94× | HRM ← peak crossover |
| 8 GB | 0.438 | 0.170 | 2.58× | HRM |
| 9 GB | 0.438 | 0.191 | 2.30× | HRM |
| 10 GB | 0.438 | 0.211 | 2.07× | HRM |
| 11+ GB | 0.438 | 0.222 | 1.97× | HRM (floor) |
Key Findings and Architectural Implications
The Bottleneck Crossover — Visualised
What This Simulator Doesn't Model
The README is transparent about the analytical model's scope, and that honesty is worth preserving here. This is a first-order latency model — not a cycle-accurate or trace-driven simulator. The following are explicitly out of scope:
| Not modelled | Expected impact direction | Future work |
|---|---|---|
| SRAM bank conflicts | Would reduce effective HRM bandwidth | Bank conflict model with access pattern distribution |
| Fabric arbitration contention | Would increase hrm_time at high utilisation | Contention exponent; multi-tenant load model |
| Thermal throttling | Would reduce sustained bandwidth over time | Thermal derating curve per workload duration |
| Multi-tenant HRM pressure | Would reduce effective hit ratio | Tenant quota model with fair-share scheduling |
| KV cache growth at long context | Would shift bottleneck tier at high context lengths | Context-length sweep study |
| Prefetch effectiveness | Would improve effective hit ratio | Prefetch model with speculative fetch accuracy |
The findings are presented as architectural hypotheses requiring validation against trace-driven simulation and eventually silicon measurement — not as production benchmarks. The value of this simulator is in identifying the right design space and the right provisioning targets to explore, not in predicting exact silicon numbers.
Relationship to the Wired Residency Primitive
SRMIC-X1 is the concrete accelerator architecture within which the hardware enforcement primitive described in the companion patent becomes physically instantiable. The RIC's pin operations are one possible implementation surface for the BIND instruction; the per-page hardware metadata provides the residency tag fields; the SRMESH fabric provides the bandwidth substrate that makes the HRM tier viable as a primary decode tier rather than just a prefetch buffer.
The two are complementary but independently motivated. SRMIC-X1 specifies the architecture. The wired residency primitive specifies the contractual interface between the runtime and the hardware — the mechanism by which a runtime can declare "this page must not leave HRM" and have that declaration enforced unconditionally by the RIC's replacement logic.
Full source code, simulation studies for both 7B and 70B model classes, architecture whitepaper, formal spec, and chiplet concept document available at github.com/manishklach/SRMIC_X1. The core analytical model is in core/decode_core.py. Study results including speedup curves and latency plots are in studies/7b/ and studies/70b/.