Why AI inference is a memory problem first
There is a question I find almost no one asks when talking about making AI faster: where does the data live, and how long does it take to get it? People talk about compute — FLOPs, tensor cores, matrix multiplication throughput. But in modern large-language-model inference, the bottleneck is almost never compute. It is memory bandwidth.
To understand why, you need to understand what the model actually does each time it generates a word. Transformer inference has two phases. In the prefill phase, the model reads your entire prompt — every token — and produces a key vector and a value vector for each token at every layer and every attention head. These key-value pairs are stored in what the industry calls the KV cache. In a 70-billion-parameter model with 80 layers, 8 KV heads, and a head dimension of 128, that is 2 × 80 × 8 × 128 = 163,840 floating-point numbers per token. At 16-bit precision, that is roughly 327 kilobytes per token.
In the decode phase, the model generates one output token at a time. At each step, it computes a query vector for the new token and computes the dot product of that query with every stored key — across every layer, every head, every prior token position. Then it fetches the corresponding values, weights them by the softmax attention scores, and produces the layer output. This happens for all 80 layers on every single decode step.
At 32,000-token context — common in modern deployments — and with the full KV cache resident in HBM (High Bandwidth Memory), each decode step reads gigabytes of data. HBM bandwidth is impressive by conventional standards: 3–4 TB/s on the latest HBM3 devices. But those 3 TB/s are shared across all operations, and the access pattern for a KV cache — many small reads scattered across a large address space — is among the most bandwidth-unfriendly workloads that exists. The result: decoding latency is dominated not by computation, but by how long it takes to retrieve cached key-value data from memory.
"In modern LLM inference, compute is rarely the bottleneck. Memory bandwidth is. And not all memory accesses are created equal."
This is the landscape into which the present invention fits. The question it answers is: given that we must read from memory on every decode step, and given that we have a small amount of extremely fast on-chip SRAM available, which data should we put there?
The phenomenon that makes this possible
In 2023, a group of researchers at MIT and elsewhere published a paper that would later become quite influential in the inference optimization community. They observed that in transformer models using softmax attention, a small number of initial token positions consistently receive disproportionately high attention weights — even when those tokens carry little semantic content. The beginning-of-sequence token. A delimiter. A repeated structural anchor. They called these positions attention sinks.
The mechanism, briefly: softmax attention must sum to one. When no other token in the context is clearly the right key to attend to, the model "dumps" the remaining probability mass somewhere. Initial tokens — which appear in every sequence and are always available — become the default recipient. Over time, through training, the model learns to rely on these positions as structural anchors. What starts as a numerical artifact becomes, in a sense, a learned behavior.
The original discovery was used to build streaming inference systems that could handle infinite-length sequences by always keeping the sink tokens in a fixed window, even as the context scrolled. That work focused on the algorithm. No one had yet asked the hardware question.
The hardware question is this: if we know — from observing the model's own attention during the prefill pass — which token positions will be accessed with very high frequency throughout the entire decode phase, why are we not placing them in the fastest available memory?
The invention, precisely
The core of the patent is a placement controller — a hardware or software component that sits between the transformer inference engine and its memory subsystem. It does four things:
- Observes: During the prefill phase, it collects attention weight information — either by tapping the attention computation pipeline or by receiving aggregated statistics — and accumulates, per token position, a sink score.
- Classifies: After prefill (or continuously during decode in the dynamic variant), it applies one or more threshold criteria to determine which token positions are sinks.
- Migrates: It triggers a DMA transfer that copies the key and value vectors for the identified sink tokens from the HBM-resident KV cache into a reserved region of on-chip SRAM.
- Routes: During decode, it intercepts KV read requests. Reads for sink-classified token positions are directed to the SRAM path; everything else goes through HBM. The partial attention outputs from the two paths are merged using numerically stable log-sum-exp accumulation.
The model output is mathematically identical to what you would get reading everything from HBM. The token generation latency is not.
The primary sink score formula
// Cumulative attention score: how much attention has token t received? S(t) = Σ A(l, h, i, t) ∀ l ∈ [1..L], h ∈ [1..H], i > t // A token is a sink if its score exceeds a configured threshold: classify_sink(t) ⟺ S(t) ≥ T_sink // Dynamic variant — exponential moving average across decode steps: E(t, n) = α · E(t, n−1) + (1−α) · B(t, n) // Where: // α = decay factor (0.85–0.99), controls how fast sinks can emerge/fade // B(t,n) = attention mass received by token t at decode step n // E(t,n) ≥ T_promote for N_dwell steps → promote to SRAM // E(t,n) < T_evict for N_evict steps → evict from SRAM
The decode-time merge
The two-path architecture requires combining partial attention results from the SRAM path (sink tokens) and the HBM path (all other tokens). This is done using a numerically stable merge based on log-sum-exp normalization:
// At each decode step n, for each head h of layer l: O_sink, lse_sink = attend(Q(l,h,n), K_sink, V_sink) // SRAM path, ~5 ns O_bulk, lse_bulk = attend(Q(l,h,n), K_bulk, V_bulk) // HBM path, ~150 ns // Stable merge (online softmax, no full recomputation): lse_max = max(lse_sink, lse_bulk) O_final = (exp(lse_sink − lse_max) · O_sink + exp(lse_bulk − lse_max) · O_bulk) / (exp(lse_sink − lse_max) + exp(lse_bulk − lse_max))