The SRAM Insight: How I Built a Patent Around AI's Oldest Memory Habit

Why AI inference is a memory problem first

There is a question I find almost no one asks when talking about making AI faster: where does the data live, and how long does it take to get it? People talk about compute — FLOPs, tensor cores, matrix multiplication throughput. But in modern large-language-model inference, the bottleneck is almost never compute. It is memory bandwidth.

To understand why, you need to understand what the model actually does each time it generates a word. Transformer inference has two phases. In the prefill phase, the model reads your entire prompt — every token — and produces a key vector and a value vector for each token at every layer and every attention head. These key-value pairs are stored in what the industry calls the KV cache. In a 70-billion-parameter model with 80 layers, 8 KV heads, and a head dimension of 128, that is 2 × 80 × 8 × 128 = 163,840 floating-point numbers per token. At 16-bit precision, that is roughly 327 kilobytes per token.

In the decode phase, the model generates one output token at a time. At each step, it computes a query vector for the new token and computes the dot product of that query with every stored key — across every layer, every head, every prior token position. Then it fetches the corresponding values, weights them by the softmax attention scores, and produces the layer output. This happens for all 80 layers on every single decode step.

At 32,000-token context — common in modern deployments — and with the full KV cache resident in HBM (High Bandwidth Memory), each decode step reads gigabytes of data. HBM bandwidth is impressive by conventional standards: 3–4 TB/s on the latest HBM3 devices. But those 3 TB/s are shared across all operations, and the access pattern for a KV cache — many small reads scattered across a large address space — is among the most bandwidth-unfriendly workloads that exists. The result: decoding latency is dominated not by computation, but by how long it takes to retrieve cached key-value data from memory.

"In modern LLM inference, compute is rarely the bottleneck. Memory bandwidth is. And not all memory accesses are created equal."

This is the landscape into which the present invention fits. The question it answers is: given that we must read from memory on every decode step, and given that we have a small amount of extremely fast on-chip SRAM available, which data should we put there?

The phenomenon that makes this possible

In 2023, a group of researchers at MIT and elsewhere published a paper that would later become quite influential in the inference optimization community. They observed that in transformer models using softmax attention, a small number of initial token positions consistently receive disproportionately high attention weights — even when those tokens carry little semantic content. The beginning-of-sequence token. A delimiter. A repeated structural anchor. They called these positions attention sinks.

The mechanism, briefly: softmax attention must sum to one. When no other token in the context is clearly the right key to attend to, the model "dumps" the remaining probability mass somewhere. Initial tokens — which appear in every sequence and are always available — become the default recipient. Over time, through training, the model learns to rely on these positions as structural anchors. What starts as a numerical artifact becomes, in a sense, a learned behavior.

The original discovery was used to build streaming inference systems that could handle infinite-length sequences by always keeping the sink tokens in a fixed window, even as the context scrolled. That work focused on the algorithm. No one had yet asked the hardware question.

2–8sink tokens per typical 8K prompt

30–50%of decode attention directed at sinks

~2.6 MBSRAM needed for 8 sink tokens (70B GQA model)

100×SRAM vs HBM latency advantage

The hardware question is this: if we know — from observing the model's own attention during the prefill pass — which token positions will be accessed with very high frequency throughout the entire decode phase, why are we not placing them in the fastest available memory?

The invention, precisely

The core of the patent is a placement controller — a hardware or software component that sits between the transformer inference engine and its memory subsystem. It does four things:

Observes: During the prefill phase, it collects attention weight information — either by tapping the attention computation pipeline or by receiving aggregated statistics — and accumulates, per token position, a sink score.
Classifies: After prefill (or continuously during decode in the dynamic variant), it applies one or more threshold criteria to determine which token positions are sinks.
Migrates: It triggers a DMA transfer that copies the key and value vectors for the identified sink tokens from the HBM-resident KV cache into a reserved region of on-chip SRAM.
Routes: During decode, it intercepts KV read requests. Reads for sink-classified token positions are directed to the SRAM path; everything else goes through HBM. The partial attention outputs from the two paths are merged using numerically stable log-sum-exp accumulation.

The model output is mathematically identical to what you would get reading everything from HBM. The token generation latency is not.

The primary sink score formula

// Cumulative attention score: how much attention has token t received?
S(t) = Σ A(l, h, i, t)   ∀ l ∈ [1..L], h ∈ [1..H], i > t

// A token is a sink if its score exceeds a configured threshold:
classify_sink(t)  ⟺  S(t) ≥ T_sink

// Dynamic variant — exponential moving average across decode steps:
E(t, n) = α · E(t, n−1) + (1−α) · B(t, n)

// Where:
//   α      = decay factor (0.85–0.99), controls how fast sinks can emerge/fade
//   B(t,n) = attention mass received by token t at decode step n
//   E(t,n) ≥ T_promote for N_dwell steps  → promote to SRAM
//   E(t,n) < T_evict  for N_evict steps   → evict from SRAM

The primary score S(t) sums attention weights received by token t across all layers, heads, and query positions. The EMA variant E(t,n) enables continuous, step-by-step re-evaluation during decode.

The decode-time merge

The two-path architecture requires combining partial attention results from the SRAM path (sink tokens) and the HBM path (all other tokens). This is done using a numerically stable merge based on log-sum-exp normalization:

// At each decode step n, for each head h of layer l:
O_sink, lse_sink = attend(Q(l,h,n), K_sink, V_sink)      // SRAM path, ~5 ns
O_bulk, lse_bulk = attend(Q(l,h,n), K_bulk, V_bulk)      // HBM path, ~150 ns

// Stable merge (online softmax, no full recomputation):
lse_max = max(lse_sink, lse_bulk)
O_final = (exp(lse_sink − lse_max) · O_sink
          + exp(lse_bulk − lse_max) · O_bulk)
        / (exp(lse_sink − lse_max) + exp(lse_bulk − lse_max))

The merge uses online softmax normalization via log-sum-exp terms, ensuring numerical stability regardless of the relative magnitudes of the two partial outputs. No recomputation of the full attention is required.

Three embodiments, three trade-off profiles

The patent covers three principal implementations of the placement controller, each targeting a different operational context.

Embodiment 1A — Threshold prefill (deterministic)

After the full prefill pass completes, the controller computes S(t) for every token, applies T_sink, and immediately triggers DMA transfer for qualifying positions. Decode begins only after SRAM is loaded. Zero runtime overhead during decode. Best for short to medium contexts with predictable, structurally-fixed sink positions (system-prompted instruction models, RAG pipelines with fixed templates).

Embodiment 1B — EMA dynamic (adaptive)

Sink scores are updated continuously using an exponential moving average. Tokens can be promoted or demoted during decode itself, with configurable dwell and eviction timers preventing thrashing. Handles long contexts where attention sinks emerge mid-sequence — common in multi-turn conversations, tool-call interleaving, and retrieval-augmented generation where anchor tokens may appear late in the prompt. The cost is additional decode-step overhead from score maintenance; the benefit is correctness across a wider workload distribution.

Embodiment 1C — Hardware controller (silicon-native)

A dedicated on-die controller with a sink score accumulator, a tag array (content-addressable memory), a DMA state machine, a bypass multiplexer in the attention datapath, and a validity and generation counter. Exposes a set of memory-mapped configuration registers for all threshold and policy parameters. The bypass multiplexer operates entirely in hardware — no software intervention is needed per decode step. This embodiment is designed for inference ASIC integration, with sub-microsecond promotion and routing latency.

Head-wise and layer-wise selectivity: doing more with less SRAM

One of the more practically important refinements in the patent is the observation that not all attention heads exhibit sink behavior with equal intensity — and not all layers do either. In a 70-billion-parameter GQA model with 8 KV heads across 80 layers, empirical profiling consistently shows that a subset of heads — often those performing global retrieval — concentrate sink behavior, while other heads (local pattern matching, induction heads, recency heads) distribute attention broadly and never develop strong sinks.

The patent covers a head-wise variant in which the placement controller maintains a per-head, per-layer sink profile. Only the key-value slices corresponding to heads and layer ranges that exhibit sink behavior are promoted to SRAM. For a model where only 3 of 8 heads consistently sink, this reduces the SRAM footprint of a promoted token by 62.5% — turning a 327 KB per-token cost into approximately 123 KB. In practice, this means the SRAM budget can accommodate three times as many sink tokens, or the same number of tokens at a reduced SRAM reservation.

"The head-wise variant is where the real SRAM efficiency comes from. A 62% reduction in per-token promotion cost lets you fit three times the sinks in the same die area."

Multi-tenant serving: turning SRAM into a shared system-prompt cache

In production LLM serving infrastructure — think a cloud API endpoint serving thousands of concurrent users — a large fraction of all requests share an identical system prompt. The system prompt establishes the model's persona, tools, output format constraints, and behavioral policies. It can be thousands of tokens long. And its first few tokens are, almost always, the most powerful attention sinks in the entire context.

The patent covers a variant in which sink-token KV state corresponding to shared prefix tokens is promoted once into a shared SRAM region, reference-counted across concurrent sessions. If 500 requests share a system prompt, the sink KV entries for that prompt are fetched from HBM and placed in SRAM exactly once — not 500 times. Each session's attention bypass reads from the shared SRAM region. Copy-on-write semantics handle per-session modifications. When the reference count drops to zero (all sessions using that system prompt have completed), the SRAM entries are released.

The practical implication: for a serving system with high prefix reuse, this variant functions as a semantic, attention-derived cache for the most important KV state in the system — completely automatic, with no explicit cache configuration required.

Differentiation from prior art

Approach	What it does	Key limitation vs. this patent
StreamingLLM Xiao et al., 2023	Retains sink tokens in a sliding window for infinite-context generation	Operates at the algorithm level — modifies which tokens the model sees. No hardware memory tier distinction. Reduces accuracy for evicted non-sink tokens.
H2O / Heavy Hitter Oracle Zhang et al., 2023	Evicts low-attention KV entries to reduce total KV cache size	Operates within a single memory tier. Evicts non-sink entries rather than placing them in a tiered hierarchy. Causes accuracy degradation for evicted tokens.
PagedAttention / vLLM Kwon et al., 2023	Pages KV cache into fixed-size blocks to reduce HBM fragmentation	Placement is attention-agnostic — pages are allocated by recency and request scheduling, not by attention-derived importance. Any page may end up anywhere in HBM.
LRU/recency SRAM caches Common hardware technique	Cache recently accessed KV entries in L2 or SRAM	Recency ≠ importance. Sink tokens accessed many steps ago may be evicted despite being the highest-reuse positions in the entire sequence. LRU is specifically blind to the persistent reuse pattern that attention sinks create.
Uniform HBM placement Standard inference	All KV state stored uniformly in HBM	No differentiation by token importance. Every decode step pays HBM latency for every token — including the small number of sink tokens that will be accessed on every step for the entire decode episode.
This patent	Selectively places attention-sink token KV state in SRAM using attention-derived placement policy, with full KV cache preserved in HBM	No accuracy degradation · No context window limitation · Compatible with GQA, MLA, MoE, speculative decoding · Head-wise and layer-wise granularity · Multi-tenant prefix sharing

Numbers: what does this actually mean in practice?

In the best mode described in the specification, the target system is a 70-billion-parameter GQA model with 80 layers, 8 KV heads, and head dimension 128, running on an accelerator with 80 GB HBM3 and 96 MB on-chip SRAM. Inference context length is 8,000–32,000 tokens.

Per-token KV state across all layers: 2 × 80 × 8 × 128 × 2 bytes (FP16) = 327,680 bytes ≈ 320 KB per token. The prefill-time sink detection, configured with T_sink = 3.0 and 2× layer-weight boosting for layers 16–32, identifies 2–8 sink tokens per prompt. Promotion of 8 sink tokens to SRAM requires 8 × 327 KB ≈ 2.62 MB — less than 3% of the 96 MB SRAM budget.

If those 8 sink tokens account for 40% of cumulative decode attention mass (a conservative estimate for instruction-tuned models), then 40% of all decode-phase KV reads are served from SRAM at ~5 ns rather than HBM at ~150 ns. The bandwidth previously consumed by 40% of decode traffic is freed for other operations. Combined with INT8 quantisation of non-sink KV entries in HBM and paged KV cache management, the specification describes a 40–55% reduction in aggregate memory bandwidth per decode step on representative instruction-following and RAG workloads.

What the road looks like from here

✓

16 May 2026 · 15:19 IST

Provisional filed — priority date established

App No. 202641062302 · Chennai IPO · ₹1,760 fee paid · Docket No. 73187

Now → May 2027 (12-month window)

Prior art search and complete specification

Professional prior art analysis · Claim refinement with a registered patent agent · Formal line drawings with reference numerals · Benchmark data for enablement

Under evaluation

International filing strategy

PCT application (establishes priority in 150+ countries for 18 months) · US direct national filing (USPTO) · EU (EPO) · Jurisdiction selection based on where AI chip design, manufacturing, and cloud inference deployment is concentrated

Within 48 months of priority date

Request for Examination (Form 18)

IPO examination commences · Prior art search · Office actions · Amendment rounds

2028–2030 (estimated)

Grant

20-year term from filing date · Portfolio piece for fundraising, licensing, and defensive use

· · ·

Why publish this now

The standard advice is to stay quiet after a provisional filing, especially in a fast-moving space. I'm doing the opposite, for a deliberate reason.

The provisional locks the priority date. The idea is documented and on the public record. What publication does is create awareness in exactly the community where this matters — hardware architects at AI chip companies, inference runtime engineers, researchers working on transformer efficiency. If someone is working on a closely related idea independently, I'd rather know that now and have a productive conversation than discover it during examination.

More importantly: the hardware question — where should KV state live in the memory hierarchy, based on attention-derived importance? — is genuinely underexplored. Most of the industry's attention (appropriately) has gone to the algorithm side: quantization, sparsity, distillation, speculative decoding. The memory hierarchy side, informed by the model's own runtime attention behaviour, has barely been touched. I think there is a substantial amount of value left in that direction, and writing about it is how I want to contribute to the conversation.

If you are working on AI inference hardware, custom accelerators, inference runtimes, or are simply interested in this space — I'd genuinely like to hear from you.

Patent on the public record

Application Number: 202641062302
Applicant / Inventor: Manish KL, Bengaluru, India
Title: Methods and Systems for Attention-Sink-Aware SRAM Placement of Key-Value State in Transformer Inference
Filed: 16 May 2026 · Chennai, Indian Patent Office