Inference is colliding with memory movement, not just compute.
As context windows grow to 128K, 1M, and beyond, LLM inference increasingly bottlenecks on the cost of moving KV tensors — not on FLOPs alone.
Every decode step must load, decompress, and attend over all stored KV state. In a 128K-token context at 32 layers and 128 heads with half-precision 128-dim keys/values, the raw KV cache approaches ~67 GB. Even with 4-bit quantization, that's ~17 GB — and it must be partially traversed every step.
The common instinct: compress KV more aggressively. GhostKV asks a different question:
Most tokens become semantically dormant.
In long-context transformers, empirical analysis of attention patterns reveals a consistent structure: a small fraction of historical tokens — typically sink tokens at position 0–5, instruction tokens, and recent tokens — dominate total attention mass. The majority are nearly inactive.
StreamingLLM observed that attention sinks are critical. H2O and SnapKV showed that only a small per-head KV budget is needed for high quality. GhostKV goes further: instead of keeping a fixed budget of the "most important" tokens, it converts dormant tokens into compact witnesses that can prove their own irrelevance — or trigger reconstruction when they can't.
GhostKV introduces attention witnesses.
Instead of storing full key/value tensors for cold tokens, GhostKV converts them into ghost records: compact structures that retain just enough information to answer one question per decode step:
A ghost record has three components:
Attention Sketch
A small projection (e.g. 8–32 dims) of the key state that allows fast approximate query-similarity scoring without loading full KV. Think locality-sensitive hashing or a random Gaussian projection of K.
Semantic Anchor ID
A reference to a shared semantic cluster or contextual anchor. Multiple nearby tokens can share one anchor, enabling residual-style reconstruction from a compact stored basis.
Residual Fingerprint
A compact uncertainty envelope that bounds the maximum error introduced by approximation. Used to ensure the elimination decision is conservative (safe) rather than greedy.
From full KV to ghost state
Standard transformers store tensors shaped KV[layer][token][head][dim]. GhostKV introduces a lifecycle with four states. Fresh tokens begin as exact KV and transition through the hierarchy as they age and their attention mass declines.
| State | Representation | Est. size per token | Purpose |
|---|---|---|---|
| Hot | Exact FP16 or BF16 KV in HBM/SRAM | ~4 KB (128d, 32h) | Recent tokens with high reuse probability |
| Warm | INT4/INT8 quantized KV | ~500–1000 B | Tokens likely still participating in attention |
| Ghost | Sketch + anchor ID + residual fingerprint | ~32–64 B | Cold tokens that only need to prove whether they matter |
| Archive | Anchor metadata or compressed fragment on DRAM/CXL/SSD | <16 B in HBM | Rarely-accessed long-term memory |
Bounded attention elimination
GhostKV tries to prove a token cannot matter before decompression or reconstruction occurs. For query vector Q and ghost token Gᵢ, the runtime computes a conservative upper-bound estimate of the maximum possible attention score:
Resurrect if: AttnUB(Q, Gi) ≥ θelim
// Runtime pseudocode — per ghost token per decode step for Gᵢ in ghost_tokens: bound = sketch_sim(Q, Gᵢ.sk) + Gᵢ.ε_res + Gᵢ.σ_anchor if bound < θ_elim: // Provably safe to skip. Never loaded, decompressed, or transferred. softmax_denom += exp(ELIM_CONSTANT) // add negligible bounded constant continue else: resurrect(Gᵢ) // DMA fetch from DRAM/CXL
Walking through one decode step
Consider a model generating token T+1 in a 32K context. The query Q encodes "What was the final conclusion?" — semantically rich with long-range retrieval intent.
Sketch → Bound → Resurrect → Attend
Traditional KV-cache systems load and decompress first, then discover many tokens were irrelevant. GhostKV inverts this order — the filtering happens near memory, before data moves to the GPU.
Sketch
Compare query sketch against ghost sketches. Cheap dot products near the memory tier.
Bound
Compute safe upper bounds on maximum possible attention contribution per ghost.
Resurrect
Fetch and reconstruct only survivors. Everything else stays cold and unloaded.
Attend
Run exact attention on a much smaller candidate set — typically 1–5% of context.
Token resurrection
Tokens that survive the elimination bound are "resurrected" — their full or near-full KV is reconstructed before being handed to the attention kernel. Several reconstruction paths can be combined:
Anchor + residual
Rebuild from a shared semantic anchor and a compact per-token residual. Lowest fetch cost; quality bounded by anchor fidelity.
Compressed fragment fetch
Pull a higher-fidelity quantized KV block from DRAM, CXL memory, or SSD. Exact quality, higher latency — pipelined with compute to hide cost.
Low-rank reconstruction
Use a learned low-rank basis (e.g. from SVD-style compression during ghosting) for approximate key/value recovery at intermediate quality.
Predictive prefetch
Scan sketches 1–2 decode steps ahead using predicted query vectors — prefetch likely resurrection candidates before the attention kernel stalls on them.
Latency model: when resurrection is viable
The core performance concern: decode steps are fast and memory-bound. If a resurrected token's DMA fetch isn't hidden, it stalls execution. The viability condition is:
where presurrect = fraction of tokens resurrected per step,
Tfetch = DMA latency per resurrected token (DRAM ~80ns, CXL ~150ns),
Tcompute = attention kernel time for hot+warm tokens.
At decode batch size = 1, compute time is short (~10–50μs on A100), making latency hiding difficult. GhostKV's primary target is batch size > 1, where compute per step is proportionally longer. At batch = 8 with 2% resurrection rate (~2,560 tokens in 128K context), a pipelined DMA over 2,560 tokens × 4KB = ~10MB at 900 GB/s HBM takes ~11μs — overlappable with attention compute. Predictive prefetch adds another buffer: scanning sketches for step T+1 while step T's attention runs.
When and where ghostification happens
Converting a token from warm KV to ghost state requires computing its sketch, assigning an anchor cluster, and writing the residual fingerprint. This must not happen on the GPU's critical decode path. Three viable approaches:
Async CPU thread
A background CPU thread processes aging tokens from a warm-to-ghost queue. Low GPU overhead; adds a small memory copy but keeps GPU decode clean.
Prefill phase
Ghostify all tokens beyond a hot window during the prefill pass, before decode begins. No per-step overhead; requires pre-identifying hot vs cold at prompt time.
Near-memory silicon
Offload sketch computation and anchor assignment to a CXL memory controller or Upmem PIM unit — the ideal long-term path but requires custom hardware.
Head-aware ghosting
Different attention heads specialize differently. Some heads are local and syntactic — they care only about nearby tokens and decay rapidly after a few hundred positions. Others are retrieval heads that carry long-range semantic dependencies across thousands of tokens.
GhostKV should not ghost all heads equally. Per-head ghost aggressiveness can be determined from profiling or learned during fine-tuning:
| Head behavior | Ghost policy | Sketch size |
|---|---|---|
| Local / positional | Aggressive ghosting after 512–1K tokens | 8–16 dims |
| Retrieval-critical | Slower decay; higher-fidelity sketch | 32–64 dims |
| Dormant / inactive | Convert rapidly to minimal witness state | 4–8 dims |
| Semantic memory | Anchor-preserving reconstruction path | 16–32 dims |
Memory-side attention filtering
GhostKV becomes especially compelling when the sketch-and-bound computation happens near memory — before data ever crosses the HBM or PCIe bandwidth bottleneck. Critically, this is not purely a future-hardware story: suitable substrates exist today.
Near-memory filter
A controller co-located with DRAM or CXL memory maintains ghost sketches and returns only candidate token IDs to the GPU — not full tensors. Upmem PIM (Processing-In-Memory) DRAM modules with embedded RISC cores are a near-term prototype target for sketch scoring.
CXL 3.0/3.1 fabric
CXL 3.0/3.1 enables Type-3 memory pooling with device-side compute. A CXL memory expander with sketch-scoring logic could filter ghost candidates before they traverse the fabric — returning token ID lists rather than tensors.
Tier-aware cache
Hot exact KV lives in HBM. Warm quantized KV in DRAM. Ghost records in a tiny HBM-resident index (~6 MB for 100K ghost tokens at 64 B each). Archive state on CXL or SSD.
Semantic memory controller
Memory acts as an active participant in attention — not a passive tensor warehouse. The ghost sketch index is a tiny structure: 100K tokens × 32 dims × 2B = 6.4 MB resident in HBM for sketch scoring on the GPU itself, if near-memory silicon isn't available.
How GhostKV differs from prior work
Several important systems address KV-cache efficiency. GhostKV is most closely related to eviction-based and sparse-attention methods, but differs in a fundamental way: it does not discard or evict tokens — it converts them into reconstructable witnesses.
| System | Core idea | What happens to cold tokens | GhostKV difference |
|---|---|---|---|
| StreamingLLM | Keep only sink + recent window | Permanently evicted — cannot resurrect | Ghost tokens stay reconstructable |
| H2O | Heavy Hitter Oracle — keep top-K attention tokens | Low-attention tokens evicted | GhostKV keeps witnesses; can resurrect if query changes |
| SnapKV | Prompt-time selection of important KV positions | Unselected positions dropped at prefill | GhostKV is query-adaptive — "importance" re-evaluated per decode step |
| PyramidKV | Layer-wise budget allocation; fewer KV at lower layers | Layer-specific eviction | GhostKV acts within layers; head-aware, query-conditioned |
| KV Quantization (KIVI, WKVQuant) | Store all KV at lower precision | Stored, decompressed every step | GhostKV skips decompression entirely for eliminated tokens |
Why this matters beyond compression
Compression reduces capacity pressure. GhostKV reduces movement pressure. These are different problems with different bottlenecks.
| Problem | GhostKV attack vector |
|---|---|
| HBM bandwidth saturation | Avoid loading cold KV for tokens that fail sketch bounds |
| Decompression stalls | Eliminated tokens are never decompressed at all |
| Long-context latency scaling | Ghost state grows the cache without growing bandwidth cost proportionally |
| Multi-user cache contention | Ghost records are tiny; less HBM pollution per user's cold history |
| Persistent agent memory | Long multi-turn histories stored as reconstructable semantic state, not raw tensors |
| Heterogeneous memory tiers | Tiered residency: sketches in HBM, reconstruction sources in DRAM/CXL/SSD |
Complexity reduction
The bandwidth savings can be expressed compactly. Let N = total context tokens, d = KV dimension, q = quantization factor, H = hot tokens, R = resurrected tokens per step. The decode memory traffic scales as:
| System | Decode memory traffic | What moves |
|---|---|---|
| Standard attention | O(N · d) | All KV tensors every step |
| Quantized KV (INT4) | O(N · d/q) | All KV, lower precision — still loads everything |
| Eviction (H2O, SnapKV) | O(K · d) | Top-K retained tokens — discards the rest permanently |
| GhostKV | O(H · d + R · d) | Hot tokens exact + resurrected survivors only |
Where R ≪ N is the key. If 85% of tokens are cold and 2% of those are resurrected per step, then R ≈ 0.017N. GhostKV's traffic approaches O(H · d) in the best case — linear in the hot window, not the full context.
Limitations & honest failure modes
A strong concept note acknowledges where it could be wrong. GhostKV has four principal threats:
⚠ Sketch collision rate
Unrelated tokens may project similarly in sketch space, triggering false resurrections that load unnecessary KV. The JL lemma bounds collision probability for random projections, but at high token volumes (100K+), even low collision rates accumulate. What's the empirical false-positive rate on real attention distributions?
⚠ Anchor drift
Semantic anchors are assigned at ghostification time. If the model's attention geometry shifts mid-generation (common in multi-topic documents), old anchors become stale — increasing σ_anchor and therefore the uncertainty term. This pushes more tokens toward resurrection (conservative but costly) or, if thresholds aren't recalibrated, toward incorrect elimination. Anchor refresh strategies are an open problem.
⚠ Hardware assumptions
The near-memory filtering argument requires compute co-located with DRAM or CXL. On vanilla A100s without Upmem PIM or CXL 3.x expanders, sketch scoring runs on the GPU itself. This is still viable (6.4 MB sketch index fits in L2 cache), but the bandwidth savings are lower because sketch data still crosses HBM, just at much lower volume than full KV.
⚠ Worst-case quality degradation
The elimination bound is conservative in theory, but in practice depends on how well the sketch captures the key distribution. If a critical token has an atypically low sketch norm — e.g. after layer normalization or head-specific scaling — its sketch_sim will underestimate its true attention score, and it may be incorrectly eliminated. Worst-case perplexity delta under aggressive ghosting needs empirical measurement across diverse tasks.
Simulation plan
GhostKV is pre-experimental. Here is the exact sequence of experiments needed to validate or falsify its core claims — ordered from cheapest to most involved.
Sketch quality audit ~1–2 days, no training
Goal: Measure how well random projections preserve key-query similarity in practice.
θ-vs-quality curve ~3–5 days
Goal: Map the elimination threshold tradeoff space empirically.
- Perplexity delta vs exact KV (primary quality signal)
- Elimination rate (% of ghost tokens skipped per step)
- Resurrection rate (% of ghost tokens fetched per step)
- LongBench task score delta (F1, Rouge-L, accuracy)
Latency benchmark ~1 week, requires implementation
Goal: Verify that sketch scoring + selective resurrection is faster than loading all KV, across batch sizes.
- p50 / p95 decode step latency (ms)
- HBM bandwidth utilization (GB/s, via nvtx/nsight)
- Throughput (tokens/sec) vs FlashAttention-2 baseline
Research questions worth pursuing
GhostKV is a pre-experimental concept. These are the hard questions that determine whether it works in practice:
Q1 — How tight can elimination bounds become?
The bound must be conservative (never wrongly eliminate) while tight enough to be useful. Loose bounds eliminate fewer tokens; tight bounds risk quality degradation. What sketch dimensionality and uncertainty model achieves the best tradeoff?
Q2 — Should sketches be learned or random?
Random Gaussian projections are provably sound (JL lemma) but agnostic to the model's attention geometry. Model-aware sketches trained jointly with the transformer could tighten bounds substantially. What's the training overhead?
Q3 — Which layers and heads form durable semantic anchors?
Anchor stability is central to reconstruction quality. If anchors drift as context grows, reconstruction error accumulates. Understanding per-layer, per-head anchor stability determines where aggressive ghosting is safe.
Q4 — Can resurrection DMA reliably overlap with decode compute?
The performance case depends on hiding resurrection latency behind prefill or attention compute. Is this feasible on current GPU/CPU memory hierarchies? What software pipelining is required?
Q5 — How does GhostKV interact with speculative decoding?
Speculative decoding generates multiple candidate tokens in parallel. Ghost elimination thresholds may need to account for the full set of candidate queries, not just the greedy next token.
Q6 — What minimal near-memory silicon is sufficient?
For the hardware co-design path: what is the minimum compute required at the memory controller to perform sketch scoring? A few multiply-accumulates per ghost record — is this feasible in a CXL or HBM PHY layer?
The main criticism — and the answer
This is the right criticism, and it has three answers:
Sketches are tiny
64 MB for 1M ghost tokens at 64B each vs ~4TB for 1M full FP16 tokens. Even if sketch scoring isn't free, the bandwidth cost is ~62.5× lower than loading full KV. The absolute number matters more than the asymptotic one.
Hierarchical sketch trees
At million-token scale, ghost records can be organized into a hierarchical index — anchor-level sketches first, then per-token sketches only for survivors of the coarse filter. O(log N) traversal instead of O(N) for the sketch scoring phase.
Near-memory filtering
With Upmem PIM or CXL-side compute, sketch scoring happens at the memory controller — the sketch data never crosses the GPU's memory bus at all. The GPU receives only a list of token IDs to resurrect.
A secondary criticism is that the system adds engineering complexity — a lifecycle manager, a sketch index, async ghostification, and a resurrection DMA pipeline — on top of an already complex inference stack. This is fair. GhostKV is not a one-line patch. The bet is that the bandwidth savings at long-context scale justify the complexity, the same way FlashAttention's tile scheduling justified its complexity over naive attention.
A different mental model for AI memory
GhostKV suggests that future inference memory systems may stop treating all stored state as raw tensors waiting to be loaded. Instead, cold memory becomes probabilistic, reconstructable, and query-conditioned:
- The exact tensor is no longer always the primary object in the cache.
- Cold tokens store enough to make an elimination decision — not enough to participate in attention directly.
- Reconstruction is expensive and rare — elimination is cheap and common.
- Memory behaves like a smart index, not a passive buffer.
Whether this pans out empirically depends on the tightness of elimination bounds, anchor stability, and whether resurrection latency can be hidden. These are solvable engineering problems. The conceptual shift — from "compress and load all" to "prove irrelevance before loading" — seems worth exploring.
Related areas
GhostKV sits at the intersection of several active research threads. Prior work it builds on or connects to:
KV-cache compression: KIVI, WKVQuant, and KVSharer explore quantization and sharing of KV tensors. Sparse/selective attention: H2O, SnapKV, PyramidKV, and ScissorHands develop eviction policies for KV budgets. Approximate attention: Reformer (LSH attention), Longformer, BigBird, and Scatterbrain use structured sparsity. Approximate nearest-neighbor search: LSH, HNSW, and FAISS form the basis for sketch-based similarity estimation. Memory-side compute: PIM (Processing-in-Memory), CXL smart memory, and near-data processing are relevant hardware directions. Semantic compression: Memorizing Transformers and MemGPT treat long-horizon memory as a retrieval problem.
The distinctive claim in GhostKV: combining sketch-based elimination, conservative upper bounds, and lazy reconstruction into a single lifecycle that avoids memory movement for cold tokens rather than just reducing their stored size.