The Memory Wall
Is Moving
Why future AI inference is an orchestration problem — and what that actually means for hardware, software, and the systems we haven't built yet.
The question nobody asks
When people argue about making AI faster, they almost always argue about compute. More tensor cores. Better matrix units. Higher FLOP counts. New chip architectures. The conversation is overwhelmingly about arithmetic — how many multiply-accumulates per second can we throw at the problem.
This framing made sense for a long time. For most of deep learning's history, the bottleneck really was compute. Training runs were compute-bound. Early inference was compute-bound. The GPU era was an era of throwing more arithmetic at a problem and watching it shrink.
But something has shifted. And I think most of the industry hasn't fully absorbed what it means.
The bottleneck in modern large-language-model inference is not compute. It is memory — specifically, memory bandwidth and the latency of moving data between different levels of the storage hierarchy. And the implication is not just that we need faster memory. The implication is that inference is becoming, at its core, an orchestration problem.
"The question isn't how fast we can compute. It's how fast we can move state — and how intelligently we decide where that state lives."
How we got here: a brief history of the wall
The term "memory wall" has existed in computer architecture since the 1990s. It describes the growing gap between processor speed and memory speed — CPUs got faster much faster than DRAM got faster, and so the time spent waiting for data became an increasingly large fraction of total execution time.1
For AI workloads, this played out differently. GPUs solved a version of the memory wall through massive parallelism and high-bandwidth memory stacked directly on the package. When you can do thousands of operations in parallel, a single memory fetch amortizes across thousands of compute operations. The arithmetic intensity — the ratio of FLOPs to bytes moved — stays high, and the compute units stay busy.
The problem is that transformer inference, specifically the autoregressive decode phase, has extremely low arithmetic intensity. Here is why.
The decode arithmetic
In the decode phase, the model generates one token at a time. At each step, it runs a forward pass for a single token — not a batch, not a sequence, just one vector. The compute for one token is tiny. But the memory read for that token is not: every decode step requires fetching the entire KV cache — the accumulated key and value vectors for every prior token, across every layer, every attention head.
At 32,000 token context, 70 billion parameters, 80 layers, 8 KV heads (GQA), and head dimension 128, in FP16:
KV bytes per token = 2 × L × H_kv × d × sizeof(fp16)
= 2 × 80 × 8 × 128 × 2
= 327,680 bytes ≈ 320 KB
Total KV cache (32K context):
32,000 × 320 KB = 10.24 GB read every single decode step
At HBM3 bandwidth ~3.2 TB/s:
10.24 GB / 3.2 TB/s = ~3.2 ms minimum just for KV reads, per step
At 100 decode steps = ~320 ms bandwidth-bound alone
The arithmetic intensity of a single decode step is roughly 1–10 FLOPs per byte. The peak arithmetic intensity at which an A100 GPU can keep its compute units busy is around 200–300 FLOPs per byte. The decode phase is running at 2–5% of arithmetic efficiency. The compute units are idle almost all the time. The bottleneck is exclusively memory.
The paradox of faster hardware
The response to this, from the hardware industry, has been to build faster and larger HBM. HBM3 doubled bandwidth over HBM2E. HBM3E pushes further. But there's a paradox here that I think is underappreciated: faster compute hardware doesn't necessarily help a memory-bound workload. It just makes the hardware more expensive and the memory bottleneck more glaring.
Worse, model scale is growing faster than bandwidth. A 70B model in 2023 is being succeeded by 200B, 400B, and mixture-of-experts models with trillions of effective parameters. Context windows have grown from 4K to 128K to 1M tokens in three years. The KV cache — scaling linearly with context length — grows faster than HBM capacity. The gap is not closing. It is widening.
"The industry response to a memory bottleneck was to build faster compute. This is like installing a wider highway when the problem is the parking lot at the end."
The only way out is not faster memory — it's smarter placement. The memory hierarchy already has fast storage available on-chip: SRAM. A modern AI accelerator has tens to hundreds of megabytes of on-chip SRAM with access latencies of 1–10 nanoseconds, versus 50–300 nanoseconds for HBM. The speed is there. The capacity is not — SRAM is expensive in die area and power, so there isn't much of it.
Which means the question becomes: given a small amount of very fast memory, what do you put in it? And more importantly, how do you decide?
The transformer inference memory hierarchy. Width encodes approximate capacity; position on the vertical axis encodes latency. Today, all KV cache reads during decode go to HBM. The patent routes sink-token KV reads to SRAM instead.
What attention sinks tell us about the problem
Here is the insight that drove the research behind my recent patent filing (Indian Patent Application 202641062302). The memory bandwidth problem is real and large — but it is not uniformly distributed. Not all KV cache reads are created equal.
Transformer models using softmax attention exhibit a striking structural regularity: a tiny number of token positions receive a disproportionate share of total attention weight, consistently, across layers, across heads, and across the entire duration of the decode phase. These positions — BOS tokens, system-prompt boundaries, structural anchors — are called attention sinks.2
The phenomenon arises from a mathematical necessity: softmax attention weights must sum to one. When a query token has no clear target to attend to, it distributes probability mass across available positions. Initial tokens — always present, always available — become the natural recipients. Through training, models learn to rely on these positions as structural anchors for probability mass that has nowhere else to go.
The consequence for memory access patterns: 2–8 token positions out of a 32,000-token context account for 30–50% of total attention weight across the decode episode. These tokens' key and value vectors are read from memory on every decode step, for every layer, for every head. They are the most-read addresses in the entire KV cache — by an enormous margin.
If you know in advance which token positions will dominate memory traffic for the entire decode phase — and you can identify them from the prefill attention weights — why are you storing them in the slow memory?
This is the question the patent answers. After the prefill pass, compute a cumulative attention score for each token position. Classify the top-scoring positions as sinks. Migrate their KV state from HBM into the on-chip SRAM fast tier before decode begins. Route KV reads for those positions to the SRAM path during decode. Merge the partial attention results from the fast and slow paths using numerically stable log-sum-exp accumulation. Model output is identical. Memory traffic is not.
Sink classification — the formula
// For each token position t, sum all attention it receives: S(t) = Σ A(l, h, i, t) ∀ l ∈ [1..L], h ∈ [1..H_kv], i > t // Classify as sink: is_sink(t) ⟺ S(t) ≥ T_sink // configurable threshold // SRAM promotion cost (70B GQA, L=80, H_kv=8, d=128, FP16): cost_per_sink = 2 × L × H_kv × d × 2 bytes = 327,680 bytes ≈ 320 KB // 8 sinks promoted: 8 × 320 KB = 2.56 MB // Budget consumed: 2.56 / 96 MB SRAM = ~2.7% // Decode attention served from SRAM: ~40%
Fig. 2 — The four-phase pipeline of the patented invention. Prefill identifies sinks; DMA promotes their KV state to SRAM; the bypass mux routes decode reads by tier; log-sum-exp merges the partial outputs. The final result is mathematically identical to all-HBM inference.
This isn't caching. It's orchestration.
The distinction matters more than it might first appear. A cache is reactive. It observes what has recently been accessed, keeps those things nearby, and evicts things that haven't been used in a while. It operates on recency — no knowledge of the future, no semantic understanding of what it's storing.
What the patent describes is fundamentally different. The placement decision is made before the first decode step, based on observed attention behavior during prefill. It knows, before decode begins, that certain token positions will be accessed on every single decode step for the entire generation. It acts on that knowledge proactively. It is forward-looking and semantic.
More importantly, the placement decision cannot be made correctly by an LRU cache. Here is why: attention sinks are accessed with high frequency throughout decode, but they were first placed in the KV cache during prefill — potentially thousands of decode steps ago. An LRU cache, operating on recency, would have evicted them long before they became critical. The cache is specifically blind to the access pattern that makes sinks important.
This is what I mean when I say inference is becoming an orchestration problem. Orchestration means explicit, forward-looking control over where state lives, when it moves, and what constraints govern its movement. It is not reaction to past behavior. It is planning based on anticipated future behavior.
"A cache reacts to what has happened. Orchestration plans for what will happen. For attention sinks, these are opposite strategies — and one of them is catastrophically wrong."
The broader shift: compute saturation and what comes after
I want to zoom out now, because the attention-sink insight is a specific instance of a general trend that I think is underappreciated in the industry discussion.
There is a progression that every major computing paradigm goes through. In the early phase, compute is the binding constraint. Hardware is expensive and scarce. Algorithms are designed to minimize operations. Efficiency means minimizing FLOPs.
As hardware matures, compute gets cheaper and more abundant. The constraint shifts. For OLTP databases in the 1990s, once CPU became cheap, the bottleneck became I/O — and storage systems became sophisticated. For web services in the 2000s, once single-server compute was plentiful, the bottleneck became coordination and consistency across distributed systems — and we built Paxos, Raft, and a generation of consensus protocols. For mobile in the 2010s, once ARM cores became fast enough, the bottleneck became battery life — and power management became a deep engineering discipline.
AI inference is moving through the same progression. We are exiting the "compute is the bottleneck" phase. We are entering the "memory movement is the bottleneck" phase. And memory movement is governed not just by hardware speed, but by policy — by decisions about what lives where, how it moves, and when.
The progression
Fig. 3 — The four eras of AI inference system design. Each era is defined by the primary binding constraint. We are at the inflection point between the bandwidth era and the orchestration era.
What orchestration actually means in practice
The word "orchestration" can sound abstract. Let me make it concrete by describing what a fully-realized memory orchestration system for transformer inference would actually look like — drawing on both the patent and the research simulator I've built around it.
The six primitives of KV orchestration
KV state is not just stored — it carries a residency guarantee. A sink token's KV vectors are guaranteed to remain in SRAM for at least N decode steps (the dwell period). Non-sink state in HBM carries a weaker guarantee. The placement controller enforces these contracts through validity bits, generation counters, and dwell timers. This is different from a cache, where eviction can happen at any time with no notice to the consumer.
In a cache, data movement is a background process invisible to the execution engine. In an orchestration model, DMA transfers are part of the execution plan. The placement controller knows when promotion needs to complete (before the first decode step that needs the data), schedules the DMA accordingly, and can stagger transfers to hide memory copy latency behind compute operations. Memory movement is a first-class operation, not a side effect.
In a multi-chip or multi-node inference system, the memory hierarchy has more dimensions: on-chip SRAM, on-package HBM, off-package pooled memory, interconnect-attached memory across chips. Where a KV entry lives determines not just its access latency but which compute units can access it and at what bandwidth. Topology-aware placement means routing KV state to the memory tier that is physically closest to the compute that will consume it.
A compiler, analyzing the transformer computation graph, can identify in advance which KV reads will be on the critical path of each decode step, which promotion windows are safe (periods where DMA can run without interfering with compute), and which execution regions must be replay-safe (cannot have external side effects). The runtime uses this plan as a constraint envelope, observing actual attention behavior to make adaptive decisions within the bounds the compiler has pre-validated.
In a serving system, many concurrent requests share identical system prompts. Orchestration enables reference-counted SRAM regions for shared-prefix sink tokens: the sink KV state is promoted once, shared across all sessions using that prefix, and released when the reference count drops to zero. This is semantically richer than a conventional cache because it is driven by the model's attention behavior, not by access patterns.
In a speculative decoding system, the decode can be "rolled back" if the verifier rejects draft tokens. Orchestration must handle this: promoted KV state during a speculative window must either be immutable (can be reused on rollback) or tracked with rollback semantics. This is a property that no standard cache interface provides. It requires explicit contract between the execution engine and the memory orchestrator.
The gap in today's systems
None of these primitives exist as standardized interfaces in today's inference stack. vLLM's PagedAttention is excellent at managing KV cache fragmentation — but it has no concept of attention-derived placement policy. FlashAttention is excellent at minimizing memory bandwidth in the attention operation itself — but it doesn't control where the KV cache lives across decode steps. HuggingFace's inference stack manages KV cache lifecycle — but with no awareness of SRAM as a placement target.
The gap is not a criticism of any of these systems. They were designed for the bandwidth-era constraints that existed when they were built. The gap is an observation about where the frontier is now.
The next generation of inference systems will need a layer between the model computation and the raw memory hardware — a memory orchestration layer that speaks the language of attention semantics, residency contracts, and DMA scheduling. Building that layer is the research program the patent represents. The simulator at github.com/manishklach/kv-attention-sram-memory-orchestrator is an early attempt to make those abstractions concrete and explorable.
A note on what this is not
I want to be precise about the claims I'm making — and the ones I'm not.
I'm not claiming that the specific parameters in the patent (T_sink = 3.0, α = 0.9, N_dwell = 12, etc.) are optimal for all models and workloads. They are illustrative. The right values depend on the model architecture, the workload distribution, the SRAM budget, and the serving requirements. Empirical calibration is needed and the patent explicitly provides configuration registers for exactly this reason.
I'm not claiming the 40–55% bandwidth reduction number is universally achievable. It is calibrated to a 70B GQA model at 8K–32K context with instruction-following and RAG workloads. At other context lengths, other model architectures, or other attention patterns, the number will be different. The point is the mechanism, not the specific number.
I'm not claiming this is the complete solution to the inference efficiency problem. It is one specific, well-defined optimization in one part of the system. The full solution will involve many such optimizations operating together.
What I am claiming is that the framing — memory movement as a first-class execution concern, driven by attention-derived semantics — is the right framing for the next era of inference systems. The specific optimization is an instance of that principle. The principle is what I think will scale.
What to read next
If this framing resonates, the following are worth reading in order:
- StreamingLLM (Xiao et al., 2023) — the original observation that attention sinks exist and can be exploited for streaming inference. The starting point for the attention-sink literature.
- H2O: Heavy Hitter Oracle (Zhang et al., 2023) — attention-importance-based KV eviction. Good comparison point for understanding what makes placement different from eviction.
- PagedAttention / vLLM (Kwon et al., 2023) — the dominant approach to KV cache management today. Understanding its design makes the gap that orchestration fills clearer.
- FlashAttention 2 and 3 (Dao et al., 2022–2024) — IO-aware attention computation. Shows what rigorous attention to memory hierarchy costs in a related problem. A template for the kind of thinking the orchestration layer needs.
- The KV Memory Orchestrator repo — the browser-native simulator built around the ideas in this post and the patent. Open index.html to explore.
Notes
- Wulf, W. and McKee, S.A., "Hitting the Memory Wall: Implications of the Obvious," ACM SIGARCH Computer Architecture News, 1995. The term "memory wall" dates to this paper.
- Xiao, G. et al., "Efficient Streaming Language Models with Attention Sinks," arXiv 2309.17453, 2023. First systematic characterization of attention sink behavior in transformer models.
- Indian Patent Application No. 202641062302. Title: "Methods and Systems for Attention-Sink-Aware SRAM Placement of Key-Value State in Transformer Inference." Filed 16 May 2026, Chennai Patent Office.