Essay No. 2  ·  Systems Thinking

The Memory Wall
Is Moving

Why future AI inference is an orchestration problem — and what that actually means for hardware, software, and the systems we haven't built yet.

The question nobody asks

When people argue about making AI faster, they almost always argue about compute. More tensor cores. Better matrix units. Higher FLOP counts. New chip architectures. The conversation is overwhelmingly about arithmetic — how many multiply-accumulates per second can we throw at the problem.

This framing made sense for a long time. For most of deep learning's history, the bottleneck really was compute. Training runs were compute-bound. Early inference was compute-bound. The GPU era was an era of throwing more arithmetic at a problem and watching it shrink.

But something has shifted. And I think most of the industry hasn't fully absorbed what it means.

The bottleneck in modern large-language-model inference is not compute. It is memory — specifically, memory bandwidth and the latency of moving data between different levels of the storage hierarchy. And the implication is not just that we need faster memory. The implication is that inference is becoming, at its core, an orchestration problem.

"The question isn't how fast we can compute. It's how fast we can move state — and how intelligently we decide where that state lives."

How we got here: a brief history of the wall

The term "memory wall" has existed in computer architecture since the 1990s. It describes the growing gap between processor speed and memory speed — CPUs got faster much faster than DRAM got faster, and so the time spent waiting for data became an increasingly large fraction of total execution time.1

For AI workloads, this played out differently. GPUs solved a version of the memory wall through massive parallelism and high-bandwidth memory stacked directly on the package. When you can do thousands of operations in parallel, a single memory fetch amortizes across thousands of compute operations. The arithmetic intensity — the ratio of FLOPs to bytes moved — stays high, and the compute units stay busy.

The problem is that transformer inference, specifically the autoregressive decode phase, has extremely low arithmetic intensity. Here is why.

The decode arithmetic

In the decode phase, the model generates one token at a time. At each step, it runs a forward pass for a single token — not a batch, not a sequence, just one vector. The compute for one token is tiny. But the memory read for that token is not: every decode step requires fetching the entire KV cache — the accumulated key and value vectors for every prior token, across every layer, every attention head.

At 32,000 token context, 70 billion parameters, 80 layers, 8 KV heads (GQA), and head dimension 128, in FP16:

KV CACHE SIZE — PER TOKEN · PER DECODE STEP
KV bytes per token  = 2 × L × H_kv × d × sizeof(fp16)
                    = 2 × 80 × 8 × 128 × 2
                    = 327,680 bytes  ≈  320 KB

Total KV cache (32K context):
  32,000 × 320 KB  = 10.24 GB   read every single decode step

At HBM3 bandwidth ~3.2 TB/s:
  10.24 GB / 3.2 TB/s  = ~3.2 ms minimum  just for KV reads, per step
  At 100 decode steps  = ~320 ms  bandwidth-bound alone
A single decode step at 32K context reads over 10 GB from HBM. This is not a rounding error — it is the dominant cost of token generation.

The arithmetic intensity of a single decode step is roughly 1–10 FLOPs per byte. The peak arithmetic intensity at which an A100 GPU can keep its compute units busy is around 200–300 FLOPs per byte. The decode phase is running at 2–5% of arithmetic efficiency. The compute units are idle almost all the time. The bottleneck is exclusively memory.

3.2msmin KV read time per step (32K ctx)
2–5%compute utilisation during decode
100×SRAM vs HBM latency advantage
320KBKV state per token (70B GQA model)

The paradox of faster hardware

The response to this, from the hardware industry, has been to build faster and larger HBM. HBM3 doubled bandwidth over HBM2E. HBM3E pushes further. But there's a paradox here that I think is underappreciated: faster compute hardware doesn't necessarily help a memory-bound workload. It just makes the hardware more expensive and the memory bottleneck more glaring.

Worse, model scale is growing faster than bandwidth. A 70B model in 2023 is being succeeded by 200B, 400B, and mixture-of-experts models with trillions of effective parameters. Context windows have grown from 4K to 128K to 1M tokens in three years. The KV cache — scaling linearly with context length — grows faster than HBM capacity. The gap is not closing. It is widening.

"The industry response to a memory bottleneck was to build faster compute. This is like installing a wider highway when the problem is the parking lot at the end."

The only way out is not faster memory — it's smarter placement. The memory hierarchy already has fast storage available on-chip: SRAM. A modern AI accelerator has tens to hundreds of megabytes of on-chip SRAM with access latencies of 1–10 nanoseconds, versus 50–300 nanoseconds for HBM. The speed is there. The capacity is not — SRAM is expensive in die area and power, so there isn't much of it.

Which means the question becomes: given a small amount of very fast memory, what do you put in it? And more importantly, how do you decide?

Figure 1 — The transformer inference memory hierarchy
The transformer inference memory hierarchy showing SRAM, HBM, and DRAM tiers with latency and capacity tradeoffs REGISTERS SRAM HBM DRAM / CXL ~256 KB 32 – 256 MB 1 – 10 ns latency ← TARGET: SINK KV 40 – 80 GB 50 – 300 ns latency · 3–4 TB/s bandwidth 256 GB – several TB 200 – 600 ns · CXL-attached / pooled FASTER SLOWER SMALLER LARGER CAPACITY → decode reads all go here today

The transformer inference memory hierarchy. Width encodes approximate capacity; position on the vertical axis encodes latency. Today, all KV cache reads during decode go to HBM. The patent routes sink-token KV reads to SRAM instead.

What attention sinks tell us about the problem

Here is the insight that drove the research behind my recent patent filing (Indian Patent Application 202641062302). The memory bandwidth problem is real and large — but it is not uniformly distributed. Not all KV cache reads are created equal.

Transformer models using softmax attention exhibit a striking structural regularity: a tiny number of token positions receive a disproportionate share of total attention weight, consistently, across layers, across heads, and across the entire duration of the decode phase. These positions — BOS tokens, system-prompt boundaries, structural anchors — are called attention sinks.2

The phenomenon arises from a mathematical necessity: softmax attention weights must sum to one. When a query token has no clear target to attend to, it distributes probability mass across available positions. Initial tokens — always present, always available — become the natural recipients. Through training, models learn to rely on these positions as structural anchors for probability mass that has nowhere else to go.

The consequence for memory access patterns: 2–8 token positions out of a 32,000-token context account for 30–50% of total attention weight across the decode episode. These tokens' key and value vectors are read from memory on every decode step, for every layer, for every head. They are the most-read addresses in the entire KV cache — by an enormous margin.

The key insight

If you know in advance which token positions will dominate memory traffic for the entire decode phase — and you can identify them from the prefill attention weights — why are you storing them in the slow memory?

This is the question the patent answers. After the prefill pass, compute a cumulative attention score for each token position. Classify the top-scoring positions as sinks. Migrate their KV state from HBM into the on-chip SRAM fast tier before decode begins. Route KV reads for those positions to the SRAM path during decode. Merge the partial attention results from the fast and slow paths using numerically stable log-sum-exp accumulation. Model output is identical. Memory traffic is not.

Sink classification — the formula

SINK SCORE — CUMULATIVE ATTENTION MASS
// For each token position t, sum all attention it receives:
S(t) = Σ A(l, h, i, t)   ∀ l ∈ [1..L], h ∈ [1..H_kv], i > t

// Classify as sink:
is_sink(t)  ⟺  S(t) ≥ T_sink   // configurable threshold

// SRAM promotion cost (70B GQA, L=80, H_kv=8, d=128, FP16):
cost_per_sink = 2 × L × H_kv × d × 2 bytes = 327,680 bytes ≈ 320 KB

// 8 sinks promoted: 8 × 320 KB = 2.56 MB
// Budget consumed: 2.56 / 96 MB SRAM = ~2.7%
// Decode attention served from SRAM: ~40%
2.7% of SRAM budget captures 40% of decode memory traffic. The asymmetry is the entire point.
Figure 2 — Attention-sink-aware SRAM placement: the full pipeline
The full attention-sink-aware SRAM placement pipeline from prefill through decode PHASE 1 — PREFILL PHASE 2 — PROMOTION PHASE 3 — DECODE PHASE 4 — MERGE Transformer layers Process all N tokens Collect A(l,h,i,t) Compute S(t) per token Sink classifier S(t) ≥ T_sink ? → tag as SINK ★ t0 S=0.97 SINK ★ t1 S=0.91 SINK t2 S=0.14 bulk ... S=0.08 bulk ★ t56 S=0.83 SINK DMA engine Copy K,V tensors HBM → SRAM SRAM sink region t0 · all layers · pinned t1 · all layers · pinned t56 · all layers · timed 2.6 MB / 96 MB budget used Tag array token → SRAM addr map sinks Query Q(l,h,n) Generated for each new decode token Bypass mux Tag-array lookup → route per position SRAM sink KV ⚡ 5 ns HBM bulk KV 150 ns sink non-sink tag lookup Log-sum-exp merge lse_max = max(lse_sink, lse_bulk) O_final = weighted merge of O_sink + O_bulk Numerically identical to full HBM O_sink O_bulk Output token 40–55% less BW used Indian Patent Application 202641062302

Fig. 2 — The four-phase pipeline of the patented invention. Prefill identifies sinks; DMA promotes their KV state to SRAM; the bypass mux routes decode reads by tier; log-sum-exp merges the partial outputs. The final result is mathematically identical to all-HBM inference.

This isn't caching. It's orchestration.

The distinction matters more than it might first appear. A cache is reactive. It observes what has recently been accessed, keeps those things nearby, and evicts things that haven't been used in a while. It operates on recency — no knowledge of the future, no semantic understanding of what it's storing.

What the patent describes is fundamentally different. The placement decision is made before the first decode step, based on observed attention behavior during prefill. It knows, before decode begins, that certain token positions will be accessed on every single decode step for the entire generation. It acts on that knowledge proactively. It is forward-looking and semantic.

More importantly, the placement decision cannot be made correctly by an LRU cache. Here is why: attention sinks are accessed with high frequency throughout decode, but they were first placed in the KV cache during prefill — potentially thousands of decode steps ago. An LRU cache, operating on recency, would have evicted them long before they became critical. The cache is specifically blind to the access pattern that makes sinks important.

This is what I mean when I say inference is becoming an orchestration problem. Orchestration means explicit, forward-looking control over where state lives, when it moves, and what constraints govern its movement. It is not reaction to past behavior. It is planning based on anticipated future behavior.

"A cache reacts to what has happened. Orchestration plans for what will happen. For attention sinks, these are opposite strategies — and one of them is catastrophically wrong."

The broader shift: compute saturation and what comes after

I want to zoom out now, because the attention-sink insight is a specific instance of a general trend that I think is underappreciated in the industry discussion.

There is a progression that every major computing paradigm goes through. In the early phase, compute is the binding constraint. Hardware is expensive and scarce. Algorithms are designed to minimize operations. Efficiency means minimizing FLOPs.

As hardware matures, compute gets cheaper and more abundant. The constraint shifts. For OLTP databases in the 1990s, once CPU became cheap, the bottleneck became I/O — and storage systems became sophisticated. For web services in the 2000s, once single-server compute was plentiful, the bottleneck became coordination and consistency across distributed systems — and we built Paxos, Raft, and a generation of consensus protocols. For mobile in the 2010s, once ARM cores became fast enough, the bottleneck became battery life — and power management became a deep engineering discipline.

AI inference is moving through the same progression. We are exiting the "compute is the bottleneck" phase. We are entering the "memory movement is the bottleneck" phase. And memory movement is governed not just by hardware speed, but by policy — by decisions about what lives where, how it moves, and when.

The progression

1
2017 – 2022 · THE COMPUTE ERA
FLOPs are the bottleneck
Transformer training and early inference are compute-bound. Hardware investment goes into tensor cores, matrix units, FP16/BF16 support. Models are small enough that KV caches fit comfortably in VRAM. Arithmetic intensity is high. Memory bandwidth is not a constraint in practice.
2
2022 – 2024 · THE CAPACITY ERA
KV cache size becomes the constraint
Models scale to 70B, 175B parameters. Context windows grow to 32K, 128K. The KV cache no longer fits in on-chip SRAM or even comfortably in VRAM. The industry responds with quantization (INT8, INT4 KV), paging (vLLM's PagedAttention), and compression. The focus is on fitting state into available memory.
3
2024 – NOW · THE BANDWIDTH ERA
Memory bandwidth is the binding constraint
Context windows at 128K–1M tokens. KV caches at 10s of GB per request. Each decode step reads gigabytes from HBM. Arithmetic utilization during decode falls to single digits. The hardware is mostly idle, waiting for data. The industry is just beginning to recognize that the right response is not faster memory — it's smarter placement.
2025 – 2028 (emerging) · THE ORCHESTRATION ERA
Memory movement becomes execution control
Inference systems develop explicit, forward-looking policies for where KV state lives, when it moves, and what residency guarantees it carries. Placement is driven by attention-derived importance, not recency. DMA scheduling becomes part of the execution plan. Compiler and runtime cooperate on memory orchestration. This is where the field is going.
Figure 3 — The four eras of AI inference systems design
Bar chart showing the four eras of AI inference: compute, capacity, bandwidth, and orchestration time Compute era 2017–2022 FLOPs constraint Capacity era 2022–2024 KV cache size Bandwidth era 2024 – now Memory BW constraint ← WE ARE HERE Orchestration era 2025–2028 Placement policy DMA scheduling Residency contracts Compiler/runtime co-design 2017 2022 2024 2025+

Fig. 3 — The four eras of AI inference system design. Each era is defined by the primary binding constraint. We are at the inflection point between the bandwidth era and the orchestration era.

What orchestration actually means in practice

The word "orchestration" can sound abstract. Let me make it concrete by describing what a fully-realized memory orchestration system for transformer inference would actually look like — drawing on both the patent and the research simulator I've built around it.

The six primitives of KV orchestration

Primitive 1 — Residency contracts

KV state is not just stored — it carries a residency guarantee. A sink token's KV vectors are guaranteed to remain in SRAM for at least N decode steps (the dwell period). Non-sink state in HBM carries a weaker guarantee. The placement controller enforces these contracts through validity bits, generation counters, and dwell timers. This is different from a cache, where eviction can happen at any time with no notice to the consumer.

Primitive 2 — DMA scheduling as execution control

In a cache, data movement is a background process invisible to the execution engine. In an orchestration model, DMA transfers are part of the execution plan. The placement controller knows when promotion needs to complete (before the first decode step that needs the data), schedules the DMA accordingly, and can stagger transfers to hide memory copy latency behind compute operations. Memory movement is a first-class operation, not a side effect.

Primitive 3 — Topology-aware placement

In a multi-chip or multi-node inference system, the memory hierarchy has more dimensions: on-chip SRAM, on-package HBM, off-package pooled memory, interconnect-attached memory across chips. Where a KV entry lives determines not just its access latency but which compute units can access it and at what bandwidth. Topology-aware placement means routing KV state to the memory tier that is physically closest to the compute that will consume it.

Primitive 4 — Compiler/runtime cooperation

A compiler, analyzing the transformer computation graph, can identify in advance which KV reads will be on the critical path of each decode step, which promotion windows are safe (periods where DMA can run without interfering with compute), and which execution regions must be replay-safe (cannot have external side effects). The runtime uses this plan as a constraint envelope, observing actual attention behavior to make adaptive decisions within the bounds the compiler has pre-validated.

Primitive 5 — Multi-tenant prefix sharing

In a serving system, many concurrent requests share identical system prompts. Orchestration enables reference-counted SRAM regions for shared-prefix sink tokens: the sink KV state is promoted once, shared across all sessions using that prefix, and released when the reference count drops to zero. This is semantically richer than a conventional cache because it is driven by the model's attention behavior, not by access patterns.

Primitive 6 — Deterministic replay windows

In a speculative decoding system, the decode can be "rolled back" if the verifier rejects draft tokens. Orchestration must handle this: promoted KV state during a speculative window must either be immutable (can be reused on rollback) or tracked with rollback semantics. This is a property that no standard cache interface provides. It requires explicit contract between the execution engine and the memory orchestrator.

The gap in today's systems

None of these primitives exist as standardized interfaces in today's inference stack. vLLM's PagedAttention is excellent at managing KV cache fragmentation — but it has no concept of attention-derived placement policy. FlashAttention is excellent at minimizing memory bandwidth in the attention operation itself — but it doesn't control where the KV cache lives across decode steps. HuggingFace's inference stack manages KV cache lifecycle — but with no awareness of SRAM as a placement target.

The gap is not a criticism of any of these systems. They were designed for the bandwidth-era constraints that existed when they were built. The gap is an observation about where the frontier is now.

Where this points

The next generation of inference systems will need a layer between the model computation and the raw memory hardware — a memory orchestration layer that speaks the language of attention semantics, residency contracts, and DMA scheduling. Building that layer is the research program the patent represents. The simulator at github.com/manishklach/kv-attention-sram-memory-orchestrator is an early attempt to make those abstractions concrete and explorable.

A note on what this is not

I want to be precise about the claims I'm making — and the ones I'm not.

I'm not claiming that the specific parameters in the patent (T_sink = 3.0, α = 0.9, N_dwell = 12, etc.) are optimal for all models and workloads. They are illustrative. The right values depend on the model architecture, the workload distribution, the SRAM budget, and the serving requirements. Empirical calibration is needed and the patent explicitly provides configuration registers for exactly this reason.

I'm not claiming the 40–55% bandwidth reduction number is universally achievable. It is calibrated to a 70B GQA model at 8K–32K context with instruction-following and RAG workloads. At other context lengths, other model architectures, or other attention patterns, the number will be different. The point is the mechanism, not the specific number.

I'm not claiming this is the complete solution to the inference efficiency problem. It is one specific, well-defined optimization in one part of the system. The full solution will involve many such optimizations operating together.

What I am claiming is that the framing — memory movement as a first-class execution concern, driven by attention-derived semantics — is the right framing for the next era of inference systems. The specific optimization is an instance of that principle. The principle is what I think will scale.

· · ·

What to read next

If this framing resonates, the following are worth reading in order:

  1. StreamingLLM (Xiao et al., 2023) — the original observation that attention sinks exist and can be exploited for streaming inference. The starting point for the attention-sink literature.
  2. H2O: Heavy Hitter Oracle (Zhang et al., 2023) — attention-importance-based KV eviction. Good comparison point for understanding what makes placement different from eviction.
  3. PagedAttention / vLLM (Kwon et al., 2023) — the dominant approach to KV cache management today. Understanding its design makes the gap that orchestration fills clearer.
  4. FlashAttention 2 and 3 (Dao et al., 2022–2024) — IO-aware attention computation. Shows what rigorous attention to memory hierarchy costs in a related problem. A template for the kind of thinking the orchestration layer needs.
  5. The KV Memory Orchestrator repo — the browser-native simulator built around the ideas in this post and the patent. Open index.html to explore.

Notes

  1. Wulf, W. and McKee, S.A., "Hitting the Memory Wall: Implications of the Obvious," ACM SIGARCH Computer Architecture News, 1995. The term "memory wall" dates to this paper.
  2. Xiao, G. et al., "Efficient Streaming Language Models with Attention Sinks," arXiv 2309.17453, 2023. First systematic characterization of attention sink behavior in transformer models.
  3. Indian Patent Application No. 202641062302. Title: "Methods and Systems for Attention-Sink-Aware SRAM Placement of Key-Value State in Transformer Inference." Filed 16 May 2026, Chennai Patent Office.