Transformer Inference KV Cache Memory Policy Attention Mechanisms Long-Context Serving

The Attention Sink Problem: Why Transformer Inference Wastes More Memory Than You Think

Transformers are forced by softmax to attend somewhere — and when there is no semantically meaningful target, they pile attention mass onto a single structural artifact: the first token. That token is useless for prediction. But its KV state must live in memory forever. This is the attention sink problem, and it is quietly inflating your cache budget at every sequence length.

Manish KL · April 2026 · ~17 min read · Systems Essay
Contents
  1. Why this matters now
  2. The softmax mass conservation problem
  3. How attention sinks form
  4. The memory footprint of a sink
  5. Visualizing the waste
  6. Long-context amplification
  7. Where current systems are blind
  8. Policy responses that actually help
  9. Toward sink-aware tier placement
  10. Final thought

Why this matters now

The last two years have been defined by a race toward longer and longer context windows. What started as 4K tokens has marched through 32K, 128K, and now into the millions. The infrastructure response has been to build bigger KV caches, add more HBM, design smarter eviction policies, and reach for tiered memory hierarchies that spill to DRAM and NVMe when the fast tier fills up.

All of that work is necessary. But almost none of it addresses a more fundamental issue that sits upstream of eviction policy and tier placement: a structurally mandated class of KV entries that cannot be evicted regardless of access pattern, because removing them causes model output to degrade in a sharp and nonlinear way.

These entries are not hot because they are semantically relevant. They are hot because softmax forces attention mass onto them as a pressure valve when the model has no better option. They are the attention sinks — and the first token in nearly every transformer sequence is one.

The first token in your sequence is not important. But its KV state must stay in memory for the entire inference run. That is not a policy choice. It is a mathematical consequence of how transformers distribute attention.

The implications for serving infrastructure are deeper than they first appear. This essay traces the mechanism from softmax arithmetic through to KV residency requirements, quantifies the memory overhead at scale, and then asks what a memory policy layer should actually do about it.

The softmax mass conservation problem

To understand sinks, you have to start with what softmax actually does to attention scores.

In the standard attention mechanism, a query vector q is dotted with every key vector k_i in the context, scaled, and then passed through softmax to produce a probability distribution over positions. The output at each position is a weighted sum of value vectors.

attention(Q, K, V) = softmax(QK^T / √d_k) · V

The critical property is that softmax outputs must sum to exactly 1.0 across all positions in the distribution. Every probability mass assigned to one position is probability mass taken away from others. This is conservation of attention.

Under ideal conditions this is fine. The model has learned which positions are contextually relevant, and it concentrates mass on them. But consider the case where a token genuinely has no useful contextual precedent — where no prior token carries information that would help predict the current output. In that situation, the model still has to put the attention mass somewhere. Softmax does not allow a "none of the above" option. It will push probability onto whatever positions have the least-negative logit scores, even if those logits are semantically meaningless.

That is the structural pressure. And it creates a reliable pattern across transformer architectures: certain positions in the sequence become attentional sinks — positions that absorb excess mass without contributing meaningfully to output quality.

How attention sinks form

The canonical sink is the first token — typically the beginning-of-sequence marker, a system prompt token, or whatever appears at position zero. The reasons are partly architectural and partly numerical.

During pretraining, the model learns that position zero is always visible. It never needs to be attended to for positional or recency reasons; it just sits there, reliably present, accumulating low-but-stable attention from every position that has nowhere better to direct its mass. Over billions of training steps, the value vectors at position zero get shaped to absorb this diffuse load gracefully — their gradient signal is dominated by "receive mass from confused queries." The result is a position whose value vector has been tuned to be a good sink rather than a good contributor to predictions.

The sink pattern has been documented empirically across GPT-2, LLaMA, Mistral, and most major open-weight model families. In every case, when you visualize attention weights across heads and layers, you find a consistent bright column at position zero that cannot be explained by semantic content.

The sink is not a bug — it is a learned structural response. The model has been trained on a softmax constraint that requires mass conservation, and it has learned to route surplus mass onto a stable, always-present token. From the model's perspective, this is working as intended. From the memory system's perspective, it means the first token's KV state must remain resident at all times, because the model has built a dependency on it at every layer.

The first token is the most reliable sink, but it is not the only one. In models with very long context, additional sinks can form at other structurally stable positions — punctuation tokens, separator tokens, and special formatting markers are common candidates. Any token that is always present and always visible becomes a candidate for the sink role.

The memory footprint of a sink

A single KV entry for one token across all layers of a modern large model is not trivial. For a 70B-parameter model with 80 transformer layers, 8 KV heads per layer, and a head dimension of 128, stored in BF16:

per-token KV = 2 (K and V) × 80 layers × 8 heads × 128 dim × 2 bytes
             = 2 × 80 × 8 × 128 × 2
             = 327,680 bytes
             ≈ 320 KB per token

That is 320 KB that must stay in the fast tier — in HBM — for the entire duration of the inference run, because the model will query against it at every single decode step. For a long-context job running for 10,000 generated tokens, that one sink entry will be accessed roughly 10,000 times, but the cost is not just bandwidth. It is residency: 320 KB of HBM locked to a single token whose semantic contribution to output quality is approximately zero.

Now multiply that across batched serving. A serving system running a batch of 64 concurrent requests, each with 128K context, has 64 sinks. That is 64 × 320 KB = 20 MB of guaranteed-resident HBM for tokens that carry no semantic information. Across a fleet of 64 serving instances, that is 1.28 GB of HBM holding worthless KV state.

The absolute number sounds manageable. But the issue is not just the absolute bytes — it is the interaction with eviction policy. Standard eviction algorithms, from simple LRU through to more sophisticated recency-weighted schemes, see the sink as one of the hottest entries in the cache because it is accessed at every decode step. The eviction policy will never touch it. It will correctly protect it because it is genuinely hot. But hotness and utility are different things, and memory systems currently have no way to distinguish between them.

Visualizing the waste

The diagram below maps the lifecycle of a sink token's KV entry through a standard serving stack. Notice that the sink entry never leaves the hot tier: it is admitted at prefill, protected by eviction policy throughout decode, and only released when the request completes. The system behaves correctly by its own rules — which is precisely the problem.

Attention sink KV residency lifecycle Diagram showing how a sink token's KV entry enters HBM at prefill, is protected by LRU eviction throughout every decode step, and is never released until request completion — even though its semantic utility is zero. PREFILL DECODE (N steps) COMPLETION HBM (fast tier) Useful KV tokens 1…N Sink KV token 0 🔒 Decode t=1 query all keys Decode t=2 query all keys LRU eviction protects it hot = always accessed Useful KV can be evicted →DRAM Sink KV stays pinned in HBM Semantic utility ≈ 0 Request ends all KV released Memory cost summary per request (70B model, 128K context, BF16) Sink KV size: ~320 KB · Residency duration: entire decode · Eviction probability: 0 Semantic contribution: near-zero · Current policy awareness of this distinction: none
Figure 1. Lifecycle of a sink token's KV entry through a standard serving stack. The sink is admitted during prefill, accessed at every decode step (making it appear hot to LRU), and never evicted until request completion — despite carrying zero semantic utility.

Long-context amplification

The sink problem is annoying at short context lengths. At long context lengths, it compounds in ways that interact badly with the tiered memory architectures the industry is building to handle them.

Consider a request with 1M token context. The KV cache for that context is enormous — on the order of several GB for a large model. The serving system has to make hard choices about what stays in HBM versus what spills to DRAM or NVMe. It will use attention-access frequency as the primary signal for what to keep hot, because that is the best signal available.

Attention frequency, however, does not distinguish between "hot because important" and "hot because it is a sink." Both will be protected. But the sink takes up HBM headroom that could have been used for a token whose value content genuinely mattered for the next generated token. When the system needs to evict something to make room, it will evict a slightly less-hot semantic token — causing a cache miss and a fetch from DRAM or NVMe that adds latency to the decode step.

The causal chain is: sink consumes HBM → useful token evicted → fetch on next access → decode latency increases → tail latency degrades. This is not a hypothetical. It is the mechanical consequence of conflating access frequency with utility, and it happens every time the cache fills up under load.

There is a second compounding effect. In models that exhibit multiple sinks — at position zero and at a few other structurally stable positions — the number of locked HBM entries grows. A model with five reliable sinks has five permanently resident entries per layer per request. For a batch of 128 requests with 1M context each, that is 5 × 80 layers × 128 requests entries that will never be evicted, each occupying the KV equivalent of 320 KB aggregated across their per-layer footprints. The locks compound across the batch.

Where current systems are blind

Most KV cache management systems today operate on one of three approaches: static maximum length limits that discard oldest tokens first, LRU-style recency weighting, or more recent attention-frequency-weighted schemes that try to approximate which tokens the model "cares about."

All three approaches share the same blind spot: they treat access frequency as a proxy for utility. The sink entries score maximally on all three metrics. They are always accessed, always recent in the attention pattern, and appear maximally important by frequency analysis. Any system that uses access frequency as its primary signal will correctly identify sinks as high-priority cache residents. And then it will make exactly the wrong decision, because frequency and utility are not the same thing.

The research community has identified this problem. The StreamingLLM work from MIT and Meta proposed reserving a fixed set of "sink slots" in the KV cache — explicitly setting aside a small number of positions for sink tokens and managing the rest of the cache independently. That approach works well in streaming inference settings where you need to process text beyond the model's native context window. But it does not fully address the residency problem in serving systems that are managing tiered memory hierarchies under real batch load.

What is needed is a richer semantic signal that can distinguish between "hot because sink" and "hot because relevant." That signal exists — it is latent in the attention pattern itself — but it requires the memory scheduler to be attention-aware in a way that current systems are not.

Attention weight distribution across token positions in a transformer layer Diagram contrasting the attention weight profile of a sink token at position zero versus semantically useful tokens across a 16K context window. The sink shows uniformly high attention from all query positions while useful tokens show sparse, query-dependent patterns. Sink token (position 0) receives attention from every query position attention weight received query position (0 → sequence end) uniformly high → LRU never evicts Useful token (mid-context) receives attention only from semantically related queries attention weight received query position (0 → sequence end) sparse spikes → evictable when inactive
Figure 2. Attention weight profiles contrasted. A sink token (left) receives uniformly high attention from every query position — it will always appear maximally hot to any frequency-based eviction policy. A semantically useful mid-context token (right) receives attention only from related query positions and is a legitimate eviction candidate when those positions are not actively decoding. Current eviction policies cannot distinguish these profiles.

Policy responses that actually help

There are several directions that a well-designed memory scheduler can take once it is aware of the sink phenomenon.

Explicit sink slot reservation

The simplest intervention is to give sinks their own reserved tier slot that is tracked separately from the general eviction pool. Instead of allowing the sink to compete for HBM space against semantic tokens — and silently winning every time — the scheduler reserves a small fixed number of slots specifically for sink entries and manages the rest of the cache independently.

This does not reduce the memory footprint of the sink. The 320 KB is still consumed. But it removes the sink from the eviction competition, which means it stops crowding out semantic tokens that have a real chance of being useful to the next decode step. The effective cache utilization for semantically relevant content goes up without adding any memory.

Sink-type labeling in the memory intent IR

A stronger intervention is to surface sink classification as a first-class signal in the interface between the compiler and the memory scheduler. If the compiler can emit a memory intent annotation that says "this KV entry is a structural sink, not a semantic entry," the scheduler can make much better decisions: place it in a lower-cost tier, deprioritize it when choosing between promotion candidates, and never count it against the utilization budget for the semantic portion of the cache.

This requires the compiler to have some model of which positions will behave as sinks. For most current architectures, that classification is learnable from attention statistics gathered during profiling — it does not require changes to the model weights or training procedure.

Attention-entropy-gated residency

A more sophisticated approach uses attention entropy as a residency signal. A genuine sink produces a very low-entropy attention-received distribution: mass arrives from nearly every query position, so the distribution is flat and entropy is low in a characteristic way. A semantically useful token produces a high-entropy distribution: mass arrives only from semantically related queries, and the distribution is sparse and peaked.

A scheduler that tracks per-entry attention entropy across decode steps can automatically classify entries as sink-like or semantic-like and route them to different eviction tiers accordingly. This is more expensive to compute than simple recency tracking, but the signal is accurate and does not require offline profiling or model-specific tuning.

Toward sink-aware tier placement

The broader principle behind all of these interventions is the same: access frequency is a necessary but not sufficient signal for KV cache management. A mature memory scheduler needs at minimum two independent signals — frequency (how often is this entry accessed) and utility (how much does this entry contribute to output quality) — and it needs to make tier placement decisions based on the combination of both, not frequency alone.

Sinks are the clearest example of the frequency-utility divergence, but they are not the only one. Tokens from the very beginning of a long system prompt may also be accessed frequently by early-context queries without contributing useful information to the current generation position. Repeated delimiter tokens across a structured document may accumulate high access counts without carrying semantic weight.

The right mental model is not "hot tokens stay in HBM, cold tokens spill to DRAM." The right model is: "semantically load-bearing tokens stay in HBM. Structurally mandated but semantically inert tokens should be tracked separately and given a different residency policy — perhaps reserved slots, perhaps a compressed representation, perhaps selective eviction with graceful degradation rather than hard pinning."

Proposed sink-aware KV cache tier architecture Architecture diagram showing separation of the KV cache into a sink reservation zone and a semantic eviction pool, both sitting in HBM, with semantic overflow spilling to DRAM and NVMe tiers. The sink zone is managed separately and does not compete with the semantic pool for eviction decisions. HBM Sink reservation zone fixed slots for sink entries separate from eviction pool no eviction competition Semantic KV pool managed by frequency + utility signals eviction based on entropy-gated policy full HBM headroom available DRAM tier semantic overflow, prefetch staging NVMe tier cold semantic context, demand-paged
Figure 3. Proposed sink-aware KV cache tier architecture. The sink reservation zone is a fixed allocation outside the eviction pool — sinks occupy it for the duration of the request but never compete with semantic tokens for HBM headroom. The semantic pool can be managed with normal eviction policy, and its full capacity is available for useful content, with graceful overflow to DRAM and NVMe for cold but semantically valid entries.

Final thought

The attention sink problem is a good example of what happens when a system is built around one signal — access frequency — without asking whether that signal is actually a proxy for the thing you care about.

In most caching problems, hot means important. That assumption is so deeply embedded in cache design that it rarely gets questioned. But transformers have a structural property that breaks the assumption: softmax forces attention mass to go somewhere even when there is no meaningful destination, and the resulting "hot" entries are hot because of mathematical necessity, not semantic content.

A memory scheduler that does not know about sinks will do exactly the right thing by its own rules and produce exactly the wrong outcome. It will protect the sink, evict the useful token, and then wonder why the cache is underperforming at long context lengths.

Fixing this requires the memory system to have a richer model of what it is caching. Access frequency needs to be joined by utility estimation — and utility, in a transformer serving context, is fundamentally an attention-derived concept. The systems that figure out how to bring that signal into the scheduler will build materially better long-context serving infrastructure. Not because they added more memory. Because they learned to tell the difference between hot and important.


Manish KL writes about AI infrastructure, memory systems, and accelerator architecture. Related essays: Speculative Decoding Is a Memory Problem · Long-Context Inference Needs Better Memory Policy, Not Just More Memory · MCOS: A Memory-Centric Operating System for AI

© 2026 Manish KL