Intent-aware KV execution for agentic long-context inference — what if the runtime tells the attention kernel which KV regions matter, instead of pretending every block is equally useful?
The missing interface is not just between model and kernel. It is between runtime intent and KV execution. When an agentic orchestrator builds a long context out of system prompts, retrieved documents, tool outputs, scratchpads, and recent conversation turns, it knows which regions are critical, which are optional, and which are filler. Today, the attention layer receives none of that signal. It processes every KV block with the same machinery, the same precision, and the same priority.
The Intent Attention Kernel project explores what changes when that structural information is exposed to the execution layer. It is a CPU-first research prototype — no GPU speedups are claimed, no production kernel is provided. But the ideas it prototypes point toward a different architecture for KV execution in the long-context regime.
Agentic workloads construct structurally heterogeneous context windows. A typical agentic prompt might contain a system prompt, summary memory, multiple retrieved documents with varying relevance, tool output, a scratchpad region, and recent conversation turns. These regions have different attention characteristics, different precision requirements, and different cache behavior.
Not all blocks deserve equal treatment. System prompts are always critical. Recent context is usually high priority. Retrieved documents may be relevant or irrelevant. Old tool output may be optional. Scratchpads may be compressible or skippable. Generic dense attention treats all of them as one undifferentiated KV stream.
Traditional attention asks: how do we compute attention faster? Intent Attention Kernel asks: what if we do not compute certain regions at all? The project represents each semantic region as a block with an explicit attention policy — ALWAYS, ATTEND (conditional on score), SKIP, RECENT (sliding window), or GLOBAL. The runtime can then select, score, quantize, and schedule KV blocks before they reach the attention kernel.
layout = BlockLayout([
SemanticBlock("system_prompt", 0, 512, BlockPolicy.ALWAYS),
SemanticBlock("retrieved_doc_0", 512, 4096, BlockPolicy.ATTEND, score=0.82),
SemanticBlock("retrieved_doc_1", 4096, 8192, BlockPolicy.SKIP, score=0.21),
SemanticBlock("recent_context", 8192, 12288, BlockPolicy.RECENT),
])
out, debug = semantic_block_attention(q, k, v, layout, return_debug=True)
The core idea is minimal: add a metadata surface between the runtime and the attention kernel that describes what each block of KV data is, how important it is, and what policy should apply to it. The project enumerates five block policies:
The key design principle: do not compute and then mask. Expose structure early enough to avoid the work entirely. A selected-block attention path gathers only the K/V tokens corresponding to non-SKIP blocks and computes attention only over those regions. This is what the CPU reference implementation demonstrates.
The project explores five complementary mechanisms for exposing runtime intent to the KV execution layer:
The runtime marks regions of context with policies. The CPU reference gathers selected K/V tokens and computes attention only over selected regions. This is the core interface between runtime intent and the attention layer.
A PyTorch CPU oracle proves correctness for non-causal selected-block attention. It compares selected-KV behavior against dense attention over the same selected K/V tensors.
A lightweight heuristic scores candidate blocks using query-to-block cosine similarity. Not a trained router — a prototype control-plane signal for the surface a future runtime could consume.
Not every KV block deserves the same precision. Critical regions stay FP16/FP8; lower-score or colder blocks use INT8, INT4, INT4_RESIDUAL, or SKIP. No accuracy or perplexity preservation is claimed.
Adjacent decode steps may reuse similar KV regions. A prefetcher predicts likely next-step pages. Current benchmark simulates hit rate only. Prefetch must never affect correctness.
Uniform KV quantization applies the same precision across all tokens regardless of their role in the context. IntentQuant-KV treats precision as an execution policy: critical blocks stay high precision, cold blocks get compressed, skipped blocks contribute zero bytes.
| Precision | Where It Applies |
|---|---|
| FP16 | Critical blocks — system prompts, global memory, recent context |
| FP8 | Important selected blocks — high-scoring attended documents |
| INT8 | Relevant but non-critical blocks — moderate-score attended regions |
| INT4_RESIDUAL | Medium-score blocks — useful but lower confidence; residual correction |
| INT4 | Cold or low-score blocks — old tool outputs, low-relevance documents |
| SKIP | Skipped blocks — zero KV bytes contributed |
quantizer = IntentQuantizer(memory_pressure=0.7)
policies = quantizer.assign_layout_precision(layout)
summary = quantizer.summary_table(layout, heads=32, head_dim=128)
No production quantization kernel exists in this project. No model-quality or perplexity validation has been performed. The quantizer uses fake quantization only. Dequant overhead may dominate any bandwidth savings. Real benefit depends on bandwidth pressure, page reuse, hardware support, and attention fusion — none of which are measured here.
The architecture separates CPU-first simulation from a future GPU kernel path, with metadata-driven selection at the center. The CPU layer provides a PyTorch reference for dense and selected-block attention, an analytical cost model for FLOP and KV-byte savings, a synthetic trace generator for deterministic agentic layouts, six benchmark scripts, and a pytest suite with 74+ tests.
The future GPU kernel path is sketched but not implemented: Triton selected-block attention that iterates only over selected pages, per-page dequantization for INT8 mixed-precision attention, a paged KV block table with physical page indirection, query-position-aware masking for correct causal selected-block attention, and prefetch or staging hints for latency hiding.
Agentic Runtime → Semantic Blocks → Policy Selection → Dynamic Scoring → Paged KV Table → IntentQuant-KV → Prefetch Prediction → Selected-KV Attention → Future GPU Kernel
How different approaches compare in terms of what they know, what work they avoid, and where the limits lie:
| Approach | What It Knows | Work Avoided | Limitation |
|---|---|---|---|
| Dense attention | Flat token stream | None | Treats all context equally |
| Masked attention | Token/block mask | Often limited | Late-binding; mask applied post-QK |
| Selected-block attention | Block bounds + policies | CPU gather over selected K/V | Needs causal / query-position support |
| Intent-aware KV execution | Policy + score + quant + prefetch | Analytical / simulated today | Hardware validation; complex interface |
Non-causal selected-block attention is implemented and verified in the CPU reference. Causal selected-block attention is intentionally not implemented because selected KV tokens retain original context positions, and a correct causal implementation requires:
Calling semantic_block_attention(..., causal=True) currently raises NotImplementedError. This is the hardest open technical problem in the project.
Every benchmark in the repository is explicit about what it does not prove. This is by design: the project is a CPU-first prototype, and conflating CPU measurement with GPU prediction would be misleading.
| Benchmark | What It Validates | What It Does Not Prove |
|---|---|---|
| bench_cost_model | Analytical FLOP and KV-byte savings | Real GPU speedup |
| bench_cpu_reference | Correctness and sanity | Fused kernel performance |
| bench_intent_quant | Byte model and reconstruction metrics | Model quality or perplexity |
| bench_prefetch | Simulated hit rate | Real latency hiding |
| bench_dynamic_scoring | Synthetic scoring behavior | Routing quality |
| bench_kv_quant | KV byte savings model | Real memory bandwidth reduction |
The project is organized into seven phases, each scoped to produce a concrete artifact before the next begins:
The project maintains explicit, documented boundaries around what it does and does not assert. These are worth reproducing because they define the scope honestly:
CPU Ratio (dense_time / semantic_time) is explicitly documented as not a GPU speedup claim. It is affected by PyTorch dispatch overhead, gather overhead, cache behavior, tensor size, and small-batch effects. The analytical KV and FLOP savings models document the theoretical upper bound of work avoidance, not measured GPU performance.
pip install -e ".[dev]"
python -m py_compile src/intent_attention/*.py
pytest -q
python benchmarks/bench_cost_model.py
python benchmarks/bench_cpu_reference.py
python benchmarks/bench_kv_quant.py
python benchmarks/bench_prefetch.py
python benchmarks/bench_dynamic_scoring.py
python benchmarks/bench_intent_quant.py