Proposal
Core Invention Claim
The novelty is not KV compression. It is query-time, bounded elimination of reconstructable KV witnesses before memory movement occurs.
Prior work compresses, evicts, or sparsifies stored KV. GhostKV converts cold tokens into witnesses that prove their own irrelevance at query time — and only then decides whether to move any data at all.

GhostKV:
Attention Without
Storing Full KV

A reconstruction-first architecture where cold KV-cache entries become tiny semantic witnesses — and exact KV is materialized only when a query proves the token could matter.

We outline a simulation plan using Llama-3-8B on 64K–128K contexts to validate elimination rates and measure perplexity drift under varied ghosting aggressiveness.

📄 Concept note 🔬 Pre-experimental 🔓 Open for collaboration 🇮🇳 Patent pending · India · 202641062451 · 2026-05-17
One decode step — 32K context, Q attends 8 tokens
T31,900
● exact
T31,850
● exact
T28,000
↑ resurrected
T19,000
✕ elim.
T11,200
✕ elim.
T4,400
✕ elim.
Cold tokens are not merely compressed — they're eliminated before decompression.
hot / exact
resurrected
ghost sketch (never loaded)
The problem

Inference is colliding with memory movement, not just compute.

As context windows grow to 128K, 1M, and beyond, LLM inference increasingly bottlenecks on the cost of moving KV tensors — not on FLOPs alone.

Every decode step must load, decompress, and attend over all stored KV state. In a 128K-token context at 32 layers and 128 heads with half-precision 128-dim keys/values, the raw KV cache approaches ~67 GB. Even with 4-bit quantization, that's ~17 GB — and it must be partially traversed every step.

~85%
of tokens become low-attention after sliding out of the local window Based on H2O / SnapKV attention-mass analysis on LongBench tasks
~67 GB
raw FP16 KV cache for 128K context, 32-layer, 128-head, 128-dim model Computed: 2×128K×32×128×128×2B
>50%
of decode wall-time on memory-bound hardware attributable to KV movement Estimated from roofline analysis; pending Llama-3 profiling

The common instinct: compress KV more aggressively. GhostKV asks a different question:

Do we need to materialize full KV for most tokens at all?
Bandwidth comparison
HBM bandwidth consumed per decode step · 128K context · relative units
Load all KV (exact) 100%
hot (5%) warm (15%) cold KV — loaded but mostly irrelevant (80%)
Load sketches + resurrect 2% ~8%
hot ghost sketches (tiny) resurrected (2%) not loaded
Illustrative. Assumes ~85% cold tokens, 2% resurrection rate, ghost sketches at ~1% of full KV size. The bandwidth story is the story.
GhostKV is not a replacement for KV quantization. INT4/INT8 quantization reduces the byte cost of every stored token. GhostKV reduces the number of tokens that ever get loaded. They are complementary: quantize warm tokens, ghost cold tokens. The biggest gains come from combining both. "Why not just use INT4?" misses that a decompressed INT4 token still costs a memory fetch; GhostKV eliminates that fetch entirely for tokens that fail the sketch bound.
Core observation

Most tokens become semantically dormant.

In long-context transformers, empirical analysis of attention patterns reveals a consistent structure: a small fraction of historical tokens — typically sink tokens at position 0–5, instruction tokens, and recent tokens — dominate total attention mass. The majority are nearly inactive.

StreamingLLM observed that attention sinks are critical. H2O and SnapKV showed that only a small per-head KV budget is needed for high quality. GhostKV goes further: instead of keeping a fixed budget of the "most important" tokens, it converts dormant tokens into compact witnesses that can prove their own irrelevance — or trigger reconstruction when they can't.

The mechanism

GhostKV introduces attention witnesses.

Instead of storing full key/value tensors for cold tokens, GhostKV converts them into ghost records: compact structures that retain just enough information to answer one question per decode step:

Could reconstructing this exact KV possibly change the attention result?

A ghost record has three components:

01

Attention Sketch

A small projection (e.g. 8–32 dims) of the key state that allows fast approximate query-similarity scoring without loading full KV. Think locality-sensitive hashing or a random Gaussian projection of K.

02

Semantic Anchor ID

A reference to a shared semantic cluster or contextual anchor. Multiple nearby tokens can share one anchor, enabling residual-style reconstruction from a compact stored basis.

03

Residual Fingerprint

A compact uncertainty envelope that bounds the maximum error introduced by approximation. Used to ensure the elimination decision is conservative (safe) rather than greedy.

Data model

From full KV to ghost state

Standard transformers store tensors shaped KV[layer][token][head][dim]. GhostKV introduces a lifecycle with four states. Fresh tokens begin as exact KV and transition through the hierarchy as they age and their attention mass declines.

StateRepresentationEst. size per tokenPurpose
Hot Exact FP16 or BF16 KV in HBM/SRAM ~4 KB (128d, 32h) Recent tokens with high reuse probability
Warm INT4/INT8 quantized KV ~500–1000 B Tokens likely still participating in attention
Ghost Sketch + anchor ID + residual fingerprint ~32–64 B Cold tokens that only need to prove whether they matter
Archive Anchor metadata or compressed fragment on DRAM/CXL/SSD <16 B in HBM Rarely-accessed long-term memory
Ghost compression ratio: A ghost record at 32–64 bytes vs a full FP16 KV at ~4 KB is a ~64–128× size reduction per token. At 100K cold tokens, that's ~400 MB of HBM pressure replaced by ~6 MB of witness records.
Key innovation

Bounded attention elimination

GhostKV tries to prove a token cannot matter before decompression or reconstruction occurs. For query vector Q and ghost token Gᵢ, the runtime computes a conservative upper-bound estimate of the maximum possible attention score:

Formal bound
AttnUB(Q, Gi)  =  sketch_sim(Q, Gi.sk)  + εres  + σanchor
sketch_sim(Q, Gᵢ.sk) — dot product of query projection against the ghost's sketch vector (~32 dims). O(d_sketch) per token.
εres — residual uncertainty: worst-case approximation error from sketch dimensionality reduction (JL bound).
σanchor — semantic variance: spread of the anchor cluster this token belongs to; wider clusters → larger safety margin.
Eliminate if:  AttnUB(Q, Gi) < θelim
Resurrect if:  AttnUB(Q, Gi) ≥ θelim
// Runtime pseudocode — per ghost token per decode step
for Gᵢ in ghost_tokens:
  bound = sketch_sim(Q, Gᵢ.sk) + Gᵢ.ε_res + Gᵢ.σ_anchor
  if bound < θ_elim:
    // Provably safe to skip. Never loaded, decompressed, or transferred.
    softmax_denom += exp(ELIM_CONSTANT)  // add negligible bounded constant
    continue
  else:
    resurrect(Gᵢ)  // DMA fetch from DRAM/CXL
The crucial shift: the system avoids memory movement, not merely storage. A compressed-KV system still loads and decompresses before scoring. GhostKV scores first, at the ghost tier, and only moves data for tokens that survive elimination.
Softmax denominator note. Standard softmax requires summing exp(score) over all tokens. When tokens are eliminated, their contribution to the denominator is approximated as exp(AttnUB(Q, Gᵢ)) — a conservative upper bound on their actual contribution. This slightly inflates the denominator (safer than omitting it), and for tokens with AttnUB ≪ θ, the contribution is negligible. A tighter treatment would track the eliminated mass explicitly and bound the resulting normalization error.
Concrete example

Walking through one decode step

Consider a model generating token T+1 in a 32K context. The query Q encodes "What was the final conclusion?" — semantically rich with long-range retrieval intent.

Token
Sketch score + uncertainty
Verdict
T31,990
Exact KV (hot) — always attend
✓ attend
T31,950
Exact KV (hot) — always attend
✓ attend
T0–T4
Sink tokens — always kept exact
✓ attend
T28,100
Ghost. sketch_sim=0.71, uncertainty=0.12 → bound=0.83 > θ=0.4
↑ resurrect
T19,000
Ghost. sketch_sim=0.08, uncertainty=0.18 → bound=0.26 < θ=0.4
✕ eliminated
T11,500
Ghost. sketch_sim=0.03, uncertainty=0.14 → bound=0.17 < θ=0.4
✕ eliminated
T5,200
Ghost. sketch_sim=0.05, uncertainty=0.20 → bound=0.25 < θ=0.4
✕ eliminated
T28,100 had high sketch similarity because it encoded "conclusion" semantically. Tokens T11K–T19K were irrelevant filler. All eliminated tokens: never fetched, never decompressed. Only T28,100 triggers a resurrection DMA from DRAM.
The pipeline

Sketch → Bound → Resurrect → Attend

Traditional KV-cache systems load and decompress first, then discover many tokens were irrelevant. GhostKV inverts this order — the filtering happens near memory, before data moves to the GPU.

1

Sketch

Compare query sketch against ghost sketches. Cheap dot products near the memory tier.

2

Bound

Compute safe upper bounds on maximum possible attention contribution per ghost.

3

Resurrect

Fetch and reconstruct only survivors. Everything else stays cold and unloaded.

4

Attend

Run exact attention on a much smaller candidate set — typically 1–5% of context.

Reconstruction

Token resurrection

Tokens that survive the elimination bound are "resurrected" — their full or near-full KV is reconstructed before being handed to the attention kernel. Several reconstruction paths can be combined:

Anchor + residual

Rebuild from a shared semantic anchor and a compact per-token residual. Lowest fetch cost; quality bounded by anchor fidelity.

Compressed fragment fetch

Pull a higher-fidelity quantized KV block from DRAM, CXL memory, or SSD. Exact quality, higher latency — pipelined with compute to hide cost.

Low-rank reconstruction

Use a learned low-rank basis (e.g. from SVD-style compression during ghosting) for approximate key/value recovery at intermediate quality.

Predictive prefetch

Scan sketches 1–2 decode steps ahead using predicted query vectors — prefetch likely resurrection candidates before the attention kernel stalls on them.

Latency model: when resurrection is viable

The core performance concern: decode steps are fast and memory-bound. If a resurrected token's DMA fetch isn't hidden, it stalls execution. The viability condition is:

Viability condition
presurrect × Tfetch < Tcompute
where presurrect = fraction of tokens resurrected per step,
Tfetch = DMA latency per resurrected token (DRAM ~80ns, CXL ~150ns),
Tcompute = attention kernel time for hot+warm tokens.

At decode batch size = 1, compute time is short (~10–50μs on A100), making latency hiding difficult. GhostKV's primary target is batch size > 1, where compute per step is proportionally longer. At batch = 8 with 2% resurrection rate (~2,560 tokens in 128K context), a pipelined DMA over 2,560 tokens × 4KB = ~10MB at 900 GB/s HBM takes ~11μs — overlappable with attention compute. Predictive prefetch adds another buffer: scanning sketches for step T+1 while step T's attention runs.

Honest limitation: at batch size = 1 on current hardware, resurrection stalls are a real risk without predictive prefetch or near-memory filtering. A pure software implementation on vanilla A100s benefits most when the resurrection rate stays below ~0.5% of context per step.

When and where ghostification happens

Converting a token from warm KV to ghost state requires computing its sketch, assigning an anchor cluster, and writing the residual fingerprint. This must not happen on the GPU's critical decode path. Three viable approaches:

Async CPU thread

A background CPU thread processes aging tokens from a warm-to-ghost queue. Low GPU overhead; adds a small memory copy but keeps GPU decode clean.

Prefill phase

Ghostify all tokens beyond a hot window during the prefill pass, before decode begins. No per-step overhead; requires pre-identifying hot vs cold at prompt time.

Near-memory silicon

Offload sketch computation and anchor assignment to a CXL memory controller or Upmem PIM unit — the ideal long-term path but requires custom hardware.

Head specialization

Head-aware ghosting

Different attention heads specialize differently. Some heads are local and syntactic — they care only about nearby tokens and decay rapidly after a few hundred positions. Others are retrieval heads that carry long-range semantic dependencies across thousands of tokens.

GhostKV should not ghost all heads equally. Per-head ghost aggressiveness can be determined from profiling or learned during fine-tuning:

Head behaviorGhost policySketch size
Local / positionalAggressive ghosting after 512–1K tokens8–16 dims
Retrieval-criticalSlower decay; higher-fidelity sketch32–64 dims
Dormant / inactiveConvert rapidly to minimal witness state4–8 dims
Semantic memoryAnchor-preserving reconstruction path16–32 dims
Hardware co-design

Memory-side attention filtering

GhostKV becomes especially compelling when the sketch-and-bound computation happens near memory — before data ever crosses the HBM or PCIe bandwidth bottleneck. Critically, this is not purely a future-hardware story: suitable substrates exist today.

Near-memory filter

A controller co-located with DRAM or CXL memory maintains ghost sketches and returns only candidate token IDs to the GPU — not full tensors. Upmem PIM (Processing-In-Memory) DRAM modules with embedded RISC cores are a near-term prototype target for sketch scoring.

CXL 3.0/3.1 fabric

CXL 3.0/3.1 enables Type-3 memory pooling with device-side compute. A CXL memory expander with sketch-scoring logic could filter ghost candidates before they traverse the fabric — returning token ID lists rather than tensors.

Tier-aware cache

Hot exact KV lives in HBM. Warm quantized KV in DRAM. Ghost records in a tiny HBM-resident index (~6 MB for 100K ghost tokens at 64 B each). Archive state on CXL or SSD.

Semantic memory controller

Memory acts as an active participant in attention — not a passive tensor warehouse. The ghost sketch index is a tiny structure: 100K tokens × 32 dims × 2B = 6.4 MB resident in HBM for sketch scoring on the GPU itself, if near-memory silicon isn't available.

The GPU no longer scans massive cold KV tensors. Memory itself performs semantic candidate elimination.
Differentiation

How GhostKV differs from prior work

Several important systems address KV-cache efficiency. GhostKV is most closely related to eviction-based and sparse-attention methods, but differs in a fundamental way: it does not discard or evict tokens — it converts them into reconstructable witnesses.

SystemCore ideaWhat happens to cold tokensGhostKV difference
StreamingLLM Keep only sink + recent window Permanently evicted — cannot resurrect Ghost tokens stay reconstructable
H2O Heavy Hitter Oracle — keep top-K attention tokens Low-attention tokens evicted GhostKV keeps witnesses; can resurrect if query changes
SnapKV Prompt-time selection of important KV positions Unselected positions dropped at prefill GhostKV is query-adaptive — "importance" re-evaluated per decode step
PyramidKV Layer-wise budget allocation; fewer KV at lower layers Layer-specific eviction GhostKV acts within layers; head-aware, query-conditioned
KV Quantization (KIVI, WKVQuant) Store all KV at lower precision Stored, decompressed every step GhostKV skips decompression entirely for eliminated tokens
The key claim: eviction-based methods permanently discard information based on historical importance. GhostKV preserves reconstructability and makes elimination decisions at query time — a token irrelevant to query Q may be essential for Q+1.
Impact

Why this matters beyond compression

Compression reduces capacity pressure. GhostKV reduces movement pressure. These are different problems with different bottlenecks.

ProblemGhostKV attack vector
HBM bandwidth saturationAvoid loading cold KV for tokens that fail sketch bounds
Decompression stallsEliminated tokens are never decompressed at all
Long-context latency scalingGhost state grows the cache without growing bandwidth cost proportionally
Multi-user cache contentionGhost records are tiny; less HBM pollution per user's cold history
Persistent agent memoryLong multi-turn histories stored as reconstructable semantic state, not raw tensors
Heterogeneous memory tiersTiered residency: sketches in HBM, reconstruction sources in DRAM/CXL/SSD
Asymptotic framing

Complexity reduction

The bandwidth savings can be expressed compactly. Let N = total context tokens, d = KV dimension, q = quantization factor, H = hot tokens, R = resurrected tokens per step. The decode memory traffic scales as:

SystemDecode memory trafficWhat moves
Standard attentionO(N · d)All KV tensors every step
Quantized KV (INT4)O(N · d/q)All KV, lower precision — still loads everything
Eviction (H2O, SnapKV)O(K · d)Top-K retained tokens — discards the rest permanently
GhostKVO(H · d + R · d)Hot tokens exact + resurrected survivors only

Where R ≪ N is the key. If 85% of tokens are cold and 2% of those are resurrected per step, then R ≈ 0.017N. GhostKV's traffic approaches O(H · d) in the best case — linear in the hot window, not the full context.

The difference from eviction: eviction methods also achieve O(K · d), but K is a fixed budget chosen at prefill time. GhostKV's R is query-adaptive — it shrinks when the query is specific and grows when it's broad. And unlike eviction, the N − H − R tokens are still reconstructable: they haven't been discarded, just deferred.
Is GhostKV approximate attention? No — with an important clarification. GhostKV performs exact attention over the surviving candidate set (hot tokens + resurrected tokens). Approximation exists only in the elimination stage: the sketch bound may conservatively include some low-relevance tokens (false resurrections), but it is designed never to exclude tokens that would have meaningfully contributed. The attention computation itself, once candidates are selected, is exact.
Threats to validity

Limitations & honest failure modes

A strong concept note acknowledges where it could be wrong. GhostKV has four principal threats:

⚠ Sketch collision rate

Unrelated tokens may project similarly in sketch space, triggering false resurrections that load unnecessary KV. The JL lemma bounds collision probability for random projections, but at high token volumes (100K+), even low collision rates accumulate. What's the empirical false-positive rate on real attention distributions?

⚠ Anchor drift

Semantic anchors are assigned at ghostification time. If the model's attention geometry shifts mid-generation (common in multi-topic documents), old anchors become stale — increasing σ_anchor and therefore the uncertainty term. This pushes more tokens toward resurrection (conservative but costly) or, if thresholds aren't recalibrated, toward incorrect elimination. Anchor refresh strategies are an open problem.

⚠ Hardware assumptions

The near-memory filtering argument requires compute co-located with DRAM or CXL. On vanilla A100s without Upmem PIM or CXL 3.x expanders, sketch scoring runs on the GPU itself. This is still viable (6.4 MB sketch index fits in L2 cache), but the bandwidth savings are lower because sketch data still crosses HBM, just at much lower volume than full KV.

⚠ Worst-case quality degradation

The elimination bound is conservative in theory, but in practice depends on how well the sketch captures the key distribution. If a critical token has an atypically low sketch norm — e.g. after layer normalization or head-specific scaling — its sketch_sim will underestimate its true attention score, and it may be incorrectly eliminated. Worst-case perplexity delta under aggressive ghosting needs empirical measurement across diverse tasks.

Validation roadmap

Simulation plan

GhostKV is pre-experimental. Here is the exact sequence of experiments needed to validate or falsify its core claims — ordered from cheapest to most involved.

1

Sketch quality audit ~1–2 days, no training

Goal: Measure how well random projections preserve key-query similarity in practice.

Model: Llama-3-8B or Mistral-7B-Instruct
Data: GovReport (long documents), SCROLLS/QMSum
Method: Run forward pass, extract all K tensors, compute 32-dim random Gaussian projections, measure cosine-sim rank preservation vs exact dot product
Metric: Sketch collision rate — fraction of tokens with sketch_sim < θ but true attn score > θ (false eliminations)
Pass criteria: false elimination rate <0.5% at θ that eliminates >70% of cold tokens.
2

θ-vs-quality curve ~3–5 days

Goal: Map the elimination threshold tradeoff space empirically.

Model: Llama-3-8B
Datasets: LongBench (multi-task), Needle-in-a-Haystack (retrieval stress test), GovReport (summarization)
Compare against: full exact KV · INT4 quantized KV (KIVI) · SnapKV (budget=512) · H2O (heavy hitter, budget=512)
Sweep: θ ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6} with sketch dim ∈ {16, 32, 64}
Metrics:
  • Perplexity delta vs exact KV (primary quality signal)
  • Elimination rate (% of ghost tokens skipped per step)
  • Resurrection rate (% of ghost tokens fetched per step)
  • LongBench task score delta (F1, Rouge-L, accuracy)
3

Latency benchmark ~1 week, requires implementation

Goal: Verify that sketch scoring + selective resurrection is faster than loading all KV, across batch sizes.

Hardware: A100-80GB (baseline), H100 SXM (stretch goal)
Implementation: Async CPU ghostification thread + GPU sketch scoring (sketch index resident in L2/HBM) + selective DMA resurrection
Batch sizes: 1, 4, 8, 16 (resurrection latency hiding improves with larger batch)
Context lengths: 32K, 64K, 128K
Metrics:
  • p50 / p95 decode step latency (ms)
  • HBM bandwidth utilization (GB/s, via nvtx/nsight)
  • Throughput (tokens/sec) vs FlashAttention-2 baseline
Pass criteria: p95 decode latency ≤ full-KV baseline at batch ≥ 4, with <1pp perplexity delta at θ achieving ≥60% cold elimination.
Why this order matters: Step 1 takes a day and either validates the sketch approach or kills it early. Step 2 maps the tradeoff space before any real implementation. Step 3 is where real engineering effort begins — only worth it if steps 1 and 2 look promising.
Open problems

Research questions worth pursuing

GhostKV is a pre-experimental concept. These are the hard questions that determine whether it works in practice:

Q1 — How tight can elimination bounds become?

The bound must be conservative (never wrongly eliminate) while tight enough to be useful. Loose bounds eliminate fewer tokens; tight bounds risk quality degradation. What sketch dimensionality and uncertainty model achieves the best tradeoff?

Q2 — Should sketches be learned or random?

Random Gaussian projections are provably sound (JL lemma) but agnostic to the model's attention geometry. Model-aware sketches trained jointly with the transformer could tighten bounds substantially. What's the training overhead?

Q3 — Which layers and heads form durable semantic anchors?

Anchor stability is central to reconstruction quality. If anchors drift as context grows, reconstruction error accumulates. Understanding per-layer, per-head anchor stability determines where aggressive ghosting is safe.

Q4 — Can resurrection DMA reliably overlap with decode compute?

The performance case depends on hiding resurrection latency behind prefill or attention compute. Is this feasible on current GPU/CPU memory hierarchies? What software pipelining is required?

Q5 — How does GhostKV interact with speculative decoding?

Speculative decoding generates multiple candidate tokens in parallel. Ghost elimination thresholds may need to account for the full set of candidate queries, not just the greedy next token.

Q6 — What minimal near-memory silicon is sufficient?

For the hardware co-design path: what is the minimum compute required at the memory controller to perform sketch scoring? A few multiply-accumulates per ghost record — is this feasible in a CXL or HBM PHY layer?

Anticipated objections

The main criticism — and the answer

"Sketch scoring itself will become bandwidth-bound at million-token scale." If you have 1M ghost tokens and must score each one against the query sketch every decode step, you're reading 1M × 64B = 64 MB of sketch data per step — which at 900 GB/s HBM takes ~71μs. That's not free.

This is the right criticism, and it has three answers:

01

Sketches are tiny

64 MB for 1M ghost tokens at 64B each vs ~4TB for 1M full FP16 tokens. Even if sketch scoring isn't free, the bandwidth cost is ~62.5× lower than loading full KV. The absolute number matters more than the asymptotic one.

02

Hierarchical sketch trees

At million-token scale, ghost records can be organized into a hierarchical index — anchor-level sketches first, then per-token sketches only for survivors of the coarse filter. O(log N) traversal instead of O(N) for the sketch scoring phase.

03

Near-memory filtering

With Upmem PIM or CXL-side compute, sketch scoring happens at the memory controller — the sketch data never crosses the GPU's memory bus at all. The GPU receives only a list of token IDs to resurrect.

A secondary criticism is that the system adds engineering complexity — a lifecycle manager, a sketch index, async ghostification, and a resurrection DMA pipeline — on top of an already complex inference stack. This is fair. GhostKV is not a one-line patch. The bet is that the bandwidth savings at long-context scale justify the complexity, the same way FlashAttention's tile scheduling justified its complexity over naive attention.

Closing framing

A different mental model for AI memory

GhostKV suggests that future inference memory systems may stop treating all stored state as raw tensors waiting to be loaded. Instead, cold memory becomes probabilistic, reconstructable, and query-conditioned:

  • The exact tensor is no longer always the primary object in the cache.
  • Cold tokens store enough to make an elimination decision — not enough to participate in attention directly.
  • Reconstruction is expensive and rare — elimination is cheap and common.
  • Memory behaves like a smart index, not a passive buffer.
GhostKV transforms cold KV-cache entries into semantic attention witnesses — reconstructing exact KV only when a query proves a token could meaningfully matter.

Whether this pans out empirically depends on the tightness of elimination bounds, anchor stability, and whether resurrection latency can be hidden. These are solvable engineering problems. The conceptual shift — from "compress and load all" to "prove irrelevance before loading" — seems worth exploring.

In the extreme limit, future transformers may never persist exact KV at all — only reconstructable semantic state trajectories.
Context

Related areas

GhostKV sits at the intersection of several active research threads. Prior work it builds on or connects to:

KV-cache compression: KIVI, WKVQuant, and KVSharer explore quantization and sharing of KV tensors. Sparse/selective attention: H2O, SnapKV, PyramidKV, and ScissorHands develop eviction policies for KV budgets. Approximate attention: Reformer (LSH attention), Longformer, BigBird, and Scatterbrain use structured sparsity. Approximate nearest-neighbor search: LSH, HNSW, and FAISS form the basis for sketch-based similarity estimation. Memory-side compute: PIM (Processing-in-Memory), CXL smart memory, and near-data processing are relevant hardware directions. Semantic compression: Memorizing Transformers and MemGPT treat long-horizon memory as a retrieval problem.

Why GhostKV is not ANN retrieval. A natural confusion: GhostKV uses sketch-based similarity scoring — doesn't that make it approximate nearest-neighbor search over KV entries? The distinction matters. ANN retrieval (FAISS, HNSW) finds semantically similar embeddings globally across a corpus, optimizing for recall over a large index. GhostKV performs bounded elimination inside the transformer attention pipeline itself — per layer, per head, per decode step — preserving autoregressive semantics, the per-head attention structure, and the softmax normalization contract. It does not retrieve; it eliminates. The query is not a search query; it is the transformer's attention query vector. The "index" is not a vector database; it is the KV cache for a single sequence. These are different problems with different correctness requirements.

The distinctive claim in GhostKV: combining sketch-based elimination, conservative upper bounds, and lazy reconstruction into a single lifecycle that avoids memory movement for cold tokens rather than just reducing their stored size.