MAN\SH AI / Writings

· AI Infrastructure & Kernel Design · ~3,200 words

Research Prototype — CPU-First KV Execution

Intent Attention Kernel

Intent-aware KV execution for agentic long-context inference — what if the runtime tells the attention kernel which KV regions matter, instead of pretending every block is equally useful?

GitHub: manishklach/intent-attention-kernel MIT License · CPU-first research prototype
Long-context inference is increasingly dominated by KV-cache capacity, memory bandwidth, page movement, and structured context management — not just attention compute. The traditional question has been how do we compute attention faster. This project asks a different question: what if the runtime tells the execution layer which KV regions matter?

The missing interface is not just between model and kernel. It is between runtime intent and KV execution. When an agentic orchestrator builds a long context out of system prompts, retrieved documents, tool outputs, scratchpads, and recent conversation turns, it knows which regions are critical, which are optional, and which are filler. Today, the attention layer receives none of that signal. It processes every KV block with the same machinery, the same precision, and the same priority.

The Intent Attention Kernel project explores what changes when that structural information is exposed to the execution layer. It is a CPU-first research prototype — no GPU speedups are claimed, no production kernel is provided. But the ideas it prototypes point toward a different architecture for KV execution in the long-context regime.

Semantic KV Blocks Selected-Block Attention IntentQuant-KV Speculative Prefetch Agentic Context

Context is not flat

Agentic workloads construct structurally heterogeneous context windows. A typical agentic prompt might contain a system prompt, summary memory, multiple retrieved documents with varying relevance, tool output, a scratchpad region, and recent conversation turns. These regions have different attention characteristics, different precision requirements, and different cache behavior.

The structure that attention ignores

Not all blocks deserve equal treatment. System prompts are always critical. Recent context is usually high priority. Retrieved documents may be relevant or irrelevant. Old tool output may be optional. Scratchpads may be compressible or skippable. Generic dense attention treats all of them as one undifferentiated KV stream.

Traditional attention asks: how do we compute attention faster? Intent Attention Kernel asks: what if we do not compute certain regions at all? The project represents each semantic region as a block with an explicit attention policy — ALWAYS, ATTEND (conditional on score), SKIP, RECENT (sliding window), or GLOBAL. The runtime can then select, score, quantize, and schedule KV blocks before they reach the attention kernel.

layout = BlockLayout([
    SemanticBlock("system_prompt",    0,     512,   BlockPolicy.ALWAYS),
    SemanticBlock("retrieved_doc_0", 512,   4096,  BlockPolicy.ATTEND, score=0.82),
    SemanticBlock("retrieved_doc_1", 4096,  8192,  BlockPolicy.SKIP,    score=0.21),
    SemanticBlock("recent_context",  8192,  12288, BlockPolicy.RECENT),
])

out, debug = semantic_block_attention(q, k, v, layout, return_debug=True)

Semantic KV block metadata

The core idea is minimal: add a metadata surface between the runtime and the attention kernel that describes what each block of KV data is, how important it is, and what policy should apply to it. The project enumerates five block policies:

ALWAYS
Always attended. System prompts, global context, immutable instructions.
ATTEND
Conditional attendance based on a runtime relevance score threshold.
SKIP
Not attended at all. Scratchpad filler, stale tool output.
RECENT
Always selected via sliding window / recent turns logic.

The key design principle: do not compute and then mask. Expose structure early enough to avoid the work entirely. A selected-block attention path gathers only the K/V tokens corresponding to non-SKIP blocks and computes attention only over those regions. This is what the CPU reference implementation demonstrates.


Five pillars of the project

The project explores five complementary mechanisms for exposing runtime intent to the KV execution layer:

Pillar 1

Semantic KV Block Selection

The runtime marks regions of context with policies. The CPU reference gathers selected K/V tokens and computes attention only over selected regions. This is the core interface between runtime intent and the attention layer.

Pillar 2

Selected-Block Attention Reference

A PyTorch CPU oracle proves correctness for non-causal selected-block attention. It compares selected-KV behavior against dense attention over the same selected K/V tensors.

Pillar 3

Dynamic Block Scoring

A lightweight heuristic scores candidate blocks using query-to-block cosine similarity. Not a trained router — a prototype control-plane signal for the surface a future runtime could consume.

Pillar 4

IntentQuant-KV

Not every KV block deserves the same precision. Critical regions stay FP16/FP8; lower-score or colder blocks use INT8, INT4, INT4_RESIDUAL, or SKIP. No accuracy or perplexity preservation is claimed.

Pillar 5

Speculative KV Prefetch

Adjacent decode steps may reuse similar KV regions. A prefetcher predicts likely next-step pages. Current benchmark simulates hit rate only. Prefetch must never affect correctness.


IntentQuant-KV: precision as a policy decision

Uniform KV quantization applies the same precision across all tokens regardless of their role in the context. IntentQuant-KV treats precision as an execution policy: critical blocks stay high precision, cold blocks get compressed, skipped blocks contribute zero bytes.

PrecisionWhere It Applies
FP16Critical blocks — system prompts, global memory, recent context
FP8Important selected blocks — high-scoring attended documents
INT8Relevant but non-critical blocks — moderate-score attended regions
INT4_RESIDUALMedium-score blocks — useful but lower confidence; residual correction
INT4Cold or low-score blocks — old tool outputs, low-relevance documents
SKIPSkipped blocks — zero KV bytes contributed
quantizer = IntentQuantizer(memory_pressure=0.7)
policies = quantizer.assign_layout_precision(layout)
summary = quantizer.summary_table(layout, heads=32, head_dim=128)
Honest boundaries

No production quantization kernel exists in this project. No model-quality or perplexity validation has been performed. The quantizer uses fake quantization only. Dequant overhead may dominate any bandwidth savings. Real benefit depends on bandwidth pressure, page reuse, hardware support, and attention fusion — none of which are measured here.


System architecture

The architecture separates CPU-first simulation from a future GPU kernel path, with metadata-driven selection at the center. The CPU layer provides a PyTorch reference for dense and selected-block attention, an analytical cost model for FLOP and KV-byte savings, a synthetic trace generator for deterministic agentic layouts, six benchmark scripts, and a pytest suite with 74+ tests.

The future GPU kernel path is sketched but not implemented: Triton selected-block attention that iterates only over selected pages, per-page dequantization for INT8 mixed-precision attention, a paged KV block table with physical page indirection, query-position-aware masking for correct causal selected-block attention, and prefetch or staging hints for latency hiding.

Data flow

Agentic Runtime → Semantic Blocks → Policy Selection → Dynamic Scoring → Paged KV Table → IntentQuant-KV → Prefetch Prediction → Selected-KV Attention → Future GPU Kernel


Dense vs. masked vs. intent-aware

How different approaches compare in terms of what they know, what work they avoid, and where the limits lie:

ApproachWhat It KnowsWork AvoidedLimitation
Dense attentionFlat token streamNoneTreats all context equally
Masked attentionToken/block maskOften limitedLate-binding; mask applied post-QK
Selected-block attentionBlock bounds + policiesCPU gather over selected K/VNeeds causal / query-position support
Intent-aware KV executionPolicy + score + quant + prefetchAnalytical / simulated todayHardware validation; complex interface

Causal attention limitation

Non-causal selected-block attention is implemented and verified in the CPU reference. Causal selected-block attention is intentionally not implemented because selected KV tokens retain original context positions, and a correct causal implementation requires:

Calling semantic_block_attention(..., causal=True) currently raises NotImplementedError. This is the hardest open technical problem in the project.


What the benchmarks actually validate

Every benchmark in the repository is explicit about what it does not prove. This is by design: the project is a CPU-first prototype, and conflating CPU measurement with GPU prediction would be misleading.

BenchmarkWhat It ValidatesWhat It Does Not Prove
bench_cost_modelAnalytical FLOP and KV-byte savingsReal GPU speedup
bench_cpu_referenceCorrectness and sanityFused kernel performance
bench_intent_quantByte model and reconstruction metricsModel quality or perplexity
bench_prefetchSimulated hit rateReal latency hiding
bench_dynamic_scoringSynthetic scoring behaviorRouting quality
bench_kv_quantKV byte savings modelReal memory bandwidth reduction

Roadmap

The project is organized into seven phases, each scoped to produce a concrete artifact before the next begins:

Phase 1 CPU-first correctness — metadata surface, block selection, dense baseline, selected-block reference, cost model, synthetic traces, benchmarks, test suite. ✓ Complete.
Phase 2 Query-position-aware causal attention — correct causal selected-block attention with explicit query positions or per-query causal bounds.
Phase 3 Paged KV block table — support for partial-page token bounds and logical KV ordering preservation.
Phase 4 Triton selected-block attention kernel — iterate only over physical pages from the block table; skip unused pages entirely.
Phase 5 IntentQuant-KV dequant-in-attention fusion — fuse per-page dequantization with the attention kernel to reduce memory round-trips.
Phase 6–7 Prefetch / staging on real NVIDIA hardware, then integration experiments with vLLM and HuggingFace Transformers.

What is not claimed

The project maintains explicit, documented boundaries around what it does and does not assert. These are worth reproducing because they define the scope honestly:

Not claimed
No GPU speedups are claimed. No production-ready Triton or CUDA kernel is claimed. No real NVIDIA hardware validation has been performed.
Not claimed
Quantization has not been validated for model accuracy or perplexity. No superiority over KIVI, KVQuant, or TurboQuant is claimed.
Not claimed
No production quantization kernel is provided. No model quality guarantee is made.
Not claimed
Prefetch has not been validated for real latency improvement. Dynamic scoring is a heuristic, not a trained routing model.

CPU Ratio (dense_time / semantic_time) is explicitly documented as not a GPU speedup claim. It is affected by PyTorch dispatch overhead, gather overhead, cache behavior, tensor size, and small-batch effects. The analytical KV and FLOP savings models document the theoretical upper bound of work avoidance, not measured GPU performance.


Quick start

pip install -e ".[dev]"
python -m py_compile src/intent_attention/*.py
pytest -q

python benchmarks/bench_cost_model.py
python benchmarks/bench_cpu_reference.py
python benchmarks/bench_kv_quant.py
python benchmarks/bench_prefetch.py
python benchmarks/bench_dynamic_scoring.py
python benchmarks/bench_intent_quant.py

Explore the repository

github.com/manishklach/intent-attention-kernel · MIT License

View on GitHub