Provisional Application filed under the Patents Act, 1970 · Docket No. 65779 · App. No. 202641056309
The Problem Nobody Is Talking About
The AI industry is obsessed with compute. More GPUs, faster chips, higher FLOPS. But the real bottleneck in production LLM inference — the thing that silently caps your throughput and balloons your cost — is not compute. It's memory orchestration.
Here's the arithmetic nobody wants to do publicly. A 70-billion parameter model running in FP16 needs to store its KV-cache — the accumulated key and value attention vectors from every token in the context — for as long as the request is active. The cache grows linearly with context length:
Plug in a real model: 80 layers, 64 heads, head dimension 128, FP16. At a 128,000-token context — increasingly standard for agents, long-document tasks, and multi-turn sessions — that's 320 GB per active request. An H100 has 80 GB of HBM. You can do the math.
The existing solutions — vLLM's PagedAttention, TRT-LLM's chunked prefill, simple NVMe offload — are all software patches on a hardware problem. They share a fatal flaw: the operating system is semantically blind. The kernel's LRU page replacement policy has no idea what a "decode step" is. It doesn't know that a block of KV vectors accessed 300 decode steps ago will almost certainly be needed again when the model's attention sweeps back toward the beginning of the context. It evicts based on recency of memory access, not recency of transformer-semantic relevance.
The result: unnecessary evictions, cold cache misses on the hot path, and GPU compute pipelines stalling to wait for PCIe DMA transfers that should have been prefetched 100 milliseconds ago.
Existing KV offload is like asking a filing clerk to manage a nuclear reactor: they'll move papers when asked, but they have no idea which papers are critical or when you'll need them next.
The Key Insight: Semantic Memory
The insight behind the KV-CPU is simple to state and hard to execute: the hardware should understand the transformer decode loop.
Autoregressive decoding has a structure that LRU/LFU policies completely ignore. Every time the model generates a new token, it is at decode step t. Every KV block in the cache has a known last-access step. The distance between the current step and a block's last access step — the step proximity — is the single most predictive signal for whether that block will be needed in the next 50–200 steps. No software LRU policy has this signal. A hardware device that receives the current decode step via a single MMIO write does.
This is the fundamental shift the KV-CPU proposes: move KV-cache policy from a reactive software loop running on the host CPU to a predictive hardware policy engine that operates in silicon at nanosecond latency, informed by transformer inference semantics that the OS has historically never received.
Architecture: A Closed-Loop System
The KV-CPU is not a memory expander with some compute bolted on. It is a closed-loop KV block lifecycle system: a single CXL 3.0 device where four co-integrated hardware components form a continuous feedback loop that runs every decode step without any host CPU involvement.
The loop works as follows. The GPU writes the current decode step t to a hardware register (a single 64-bit MMIO write, ~50 ns). This triggers the HEPC to scan all tracked KV blocks, recompute their priority scores, schedule low-priority blocks for async eviction to T2/T3, and proactively stage high-priority blocks into T1. Meanwhile, the GPU is already starting its next decode step — it never waits. The NMCE handles attention scoring for any blocks that land in T1, returning only scalar scores to the GPU across PCIe instead of raw key vectors. The loop closes when those score operations update the RTBD's access metadata, which feeds the next HEPC priority scan.
The central inventive concept: It is not any single component but the coupling — NMCE output semantically informs HEPC priority; HEPC drives RTBD tier placement; RTBD lookup enables NMCE compute — that makes this architecture novel and hard to design around.
The four pillars are independently patentable and are claimed independently in the patent, so each can be pursued even if any individual pillar faces prior art:
NMCE — Near-Memory Compute Engine
Computes attention score dot products from key vectors resident in on-device LPDDR5X. Only scalar scores traverse PCIe. 25.6×–128× traffic reduction.
HEPC — Hardware Eviction & Prefetch Controller
Fixed-function FSM computing P(Bᵢ) = wᵣR + w_fF + w_sS + w_dD in silicon. Decode-step-aware eviction triggered at ~50 ns via MMIO.
RTBD — Request-Tagged Block Directory
Hardware SRAM CAM with per-request isolation and hardware reference-counted prefix sharing. Saves C × prefix_KV_GB for C concurrent requests.
Kernel Driver — Hardware Control Plane
Linux kernel driver as hardware control plane. Every madvise call, io_uring opcode, and ioctl terminates in a hardware register write to HEPC or RTBD silicon.
Pillar I: The NMCE — Rethinking Where Attention Happens
Today, when an inference server needs to compute attention scores for a KV block that has been offloaded from GPU HBM to T1 DRAM, it does the following: fetch the entire key block across PCIe (8,192 bytes for D=128, B=32, FP16), compute the dot products on the GPU, and discard the key data. The only thing the GPU actually needed from that transfer was a vector of 32 scalar scores — 64 bytes.
The NMCE eliminates the key vector transfer entirely. It is a fixed-function dot-product array embedded in the KV-CPU logic die, physically adjacent to the LPDDR5X memory controller. The GPU sends a query vector Q_h (256 bytes) and a memory address range. The NMCE fetches the key block from T1 over the on-package interconnect — never touching PCIe — computes the scores, and sends back 64 bytes of scalars.
/* Signal step t=128 → triggers HEPC scan + NMCE pipeline setup */
struct kv_cpu_step_info step = { .step = 128 };
ioctl(fd, KV_CPU_STEP_ADVANCE, &step);
/* Mark prefix KV range as permanently hot */
struct kv_cpu_block_info hot = {
.va = (uint64_t)prefix_kv_ptr,
.len = prefix_kv_len,
};
ioctl(fd, KV_CPU_MARK_HOT, &hot);
/* Hint that a specific block will be needed at step 256 */
struct kv_cpu_block_info pf = {
.va = (uint64_t)future_block_ptr,
.len = block_size,
.target_step = 256,
};
ioctl(fd, KV_CPU_PREFETCH, &pf);
The NMCE is not a general-purpose PIM (Processing-In-Memory) unit. It does not run GEMM or generic vector operations. Its arithmetic circuit is specifically designed for scaled dot-product attention — query × key with 1/√D scaling and a piecewise-linear exp() approximation. This specificity is both a feature (efficiency) and a patent moat: a competitor adding generic ALUs to their memory chips does not replicate the NMCE without explicitly targeting transformer attention semantics.
Pillar II: The HEPC — Making Eviction Transformer-Aware
The HEPC is the policy brain of the KV-CPU. At its core is a composite priority scoring formula computed in fixed-function combinational logic:
Each component captures a different dimension of KV block importance:
| Component | Signal | Why it matters for transformers |
|---|---|---|
| R(Bᵢ) — Recency | Saturating counter, reset on access, decays per step | Recently used blocks likely to be used again soon — same intuition as LRU but in decode steps, not wall-clock time |
| F(Bᵢ) — Frequency | Saturating counter, incremented on each access | Identifies working set — system prompt blocks touched every step saturate quickly |
| S(Bᵢ) — Step Proximity | max(0, W − (t_current − t_last_access)) | The novel component. No LRU/LFU/ARC policy has this signal. Requires knowing the current decode step from the GPU — impossible for any software-only eviction policy. |
| D(Bᵢ) — Prefix Dependency | Binary flag from RTBD is_prefix field, weighted maximally | Shared prefix blocks must survive eviction while any request references them — enforced in hardware, sub-100 ns |
Once the GPU writes to the step-advance register, the HEPC runs its three-phase cycle entirely in hardware — scan all RTBD entries, evict low-priority blocks via async DMA to T2/T3, prefetch high-priority blocks from T2/T3 to T1. The eviction and prefetch DMA engines run on separate channels so they never serialise. The GPU is already computing the next token. Nobody waits.
Why software can't replicate this: A software eviction policy running on the host CPU incurs a minimum of 200–600 µs per decision (kernel scheduling + syscall overhead). The HEPC responds to a step-advance signal in ~50 ns — a 4,000× to 12,000× latency reduction. At 50 decode steps per second with 65,000 tracked KV blocks, software would spend nearly all its time on eviction decisions. The HEPC does it for free.
Pillar III: The RTBD — Hardware-Enforced Multi-Tenant Isolation
Production inference deployments are multi-tenant by default. Dozens or hundreds of concurrent requests share the same GPU, the same KV-CPU, and — critically — often the same system prompt prefix. Without hardware isolation, KV blocks from different requests can collide, corrupt each other's cache, or compete unfairly for eviction priority.
The RTBD is a fully-associative Content-Addressable Memory (CAM) in SRAM on the KV-CPU logic die, supporting up to 65,536 tracked KV blocks simultaneously. Each entry is a 240-bit descriptor covering: request_id (16b), layer_idx (8b), head_idx (8b), token range (64b), current tier location (2b: GPU/T1/T2/T3), physical address (64b), priority score (16b), prefix flag (1b), reference count (8b), access step (32b), and dirty bit (1b).
The prefix sharing mechanism is the most commercially interesting feature. In a system with 100 concurrent requests all using the same 4,096-token system prompt, a naïve system stores 100 copies of ~10.7 GB of prefix KV vectors — 1.07 TB of T1 for prefix alone. The RTBD's hardware reference counting stores exactly one copy, tagged with request_id=0x0000 (shared) and is_prefix=1. The reference counter prevents the HEPC from evicting the block while any request holds a reference. 100 requests, 10.7 GB — a 100× T1 capacity saving on the single most memory-intensive data structure in production LLM serving.
/* When a new request joins sharing a common prefix */
struct kv_cpu_block_info share = {
.va = (uint64_t)prefix_kv_block_pa,
.len = prefix_block_size,
};
ioctl(fd, KV_CPU_SHARE_PREFIX, &share);
/* → Hardware increments ref_count; HEPC cannot evict this block */
/* When request terminates */
ioctl(fd, KV_CPU_EVICT, &share);
/* → Hardware decrements ref_count; eviction eligible only when ref_count == 0 */
Pillar IV: The Kernel Driver — Hardware Control Plane
The Linux kernel driver (kv_cpu.ko) is Pillar IV — and it is the piece that transforms the KV-CPU from an interesting chip idea into a deployable system. It is also the piece that people most frequently misunderstand.
The driver is not a software invention. It is a hardware control plane. Here is the critical distinction: every API the driver exposes — every ioctl, every madvise call, every io_uring opcode — terminates in a write to a hardware register in the KV-CPU silicon. The HEPC priority scoring cannot function correctly without receiving the decode step via the step-advance register write. The RTBD eviction guard for prefix blocks cannot activate without the RTBD_SHARE register command. Remove the driver and the hardware loop cannot close. Remove the hardware and the driver is meaningless. They are co-dependent.
The driver architecture in the repository under drivers/misc/kv_cpu/ implements:
- PCI probe/remove lifecycle — standard
pci_driverframework with BAR0 MMIO mapping, MSI-X interrupt handling, and mock mode for testing without hardware - Character device
/dev/kvcpu0— primary UAPI entry point for ioctl commands:KV_CPU_STEP_ADVANCE,KV_CPU_MARK_HOT,KV_CPU_EVICT,KV_CPU_PREFETCH,KV_CPU_SHARE_PREFIX - HMAT NUMA tier registration — registers T1 LPDDR5X as a distinct NUMA node via
add_memory_driver_managed()andalloc_memory_type(), making it addressable viambind()andnumactl - madvise extensions —
MADV_KV_HOT(25),MADV_KV_EVICT(26),MADV_KV_PREFETCH(27) — pending kernel RFC, ioctl fallback available - io_uring opcodes —
IORING_OP_KV_STAGEandIORING_OP_KV_EVICTfor zero-copy async tier migration via registered fixed buffers
# Load the driver (mock mode — no hardware required)
sudo insmod kv_cpu.ko mock=1
# Signal decode step 128 → triggers HEPC hardware scan
./tools/kvctl step 128
# Mark a KV block range as hot (protect from eviction)
./tools/kvctl hot 0x7f001000 4096
# Schedule immediate eviction
./tools/kvctl evict 0x7f005000 4096
# Hint that a block will be needed at step 256
./tools/kvctl prefetch 0x7f009000 4096 256
# Mark a prefix range as shared across requests
./tools/kvctl share 0x7f010000 10737418240
The driver is structured for upstreaming. The include/uapi/linux/kv_cpu.h header defines the stable ABI contract between userspace LLM runtimes and the kernel, following Linux UAPI conventions. The source lives under drivers/misc/kv_cpu/ — the correct location for a new device class pending its own drivers/kvcpu/ subsystem.
Low-latency step-advance path: For the absolute hot path — GPU-side processes that need sub-100 ns step-advance without any syscall overhead — the driver exposes a BAR0 mmap window. A userspace process can memory-map the HEPC control registers and write the step-advance register directly. On real hardware this is a single PCIe TLP write with end-to-end latency under 100 ns.
Hardware Design: From Concept to Silicon Specification
The repository includes not just the kernel driver but a full hardware specification covering RTL, MMIO address maps, packaging, and thermal design. This is unusual for an open-source patent-pending project — and intentional. The goal is to provide enough detail that a chip company, foundry partner, or university research group could begin an implementation with this repo as the starting point.
The SystemVerilog RTL files in hardware/ implement the two most complex components:
kv_cpu_hepc_scorer.sv — The HEPC priority engine. A purely combinational design: step_diff = current_step - last_access_step, then S_component = (step_diff < 255) ? (255 - step_diff) : 0, then the composite score p_score = (w_r*R) + (w_f*F) + (w_s*S) + (w_d*D_flag). The DMA trigger logic includes the guard: if (is_prefix && ref_count > 0) trigger_evict_dma = 0. This guard is critical — it is the hardware implementation of the RTBD's eviction protection for shared prefix blocks.
kv_cpu_nmce_dpu.sv — The NMCE dot-product unit array. A 128-wide systolic MAC array parameterized over head dimension D and block size B. Computes score[i] = approx_exp(dot_product(Q, K[i]) * scale_factor) for all B key vectors in a single pipeline pass. The approx_exp() function is a piecewise-linear approximation of sufficient accuracy (±0.1% relative error) for attention score ranking.
The silicon specification targets:
| Parameter | Specification |
|---|---|
| Process node | TSMC N5 or N7 FinFET |
| NMCE DPU array | 128× FP16/BF16 MAC units · 256 GFLOPS on-device |
| RTBD CAM | 65,536 entries · 240 bits per entry · SRAM |
| T1 memory | 128–512 GB LPDDR5X · 4× 32-bit channels · 256 GB/s |
| Host interface | CXL 3.0 / PCIe 5.0 (x16 · 64 GB/s) |
| TDP | 20–28 W active · 8–12 W idle · 3 W standby |
| Core voltage | 0.75V Vdd |
| Form factor | PCIe HHHL or CXL EDSFF E3.S |
| Step-advance latency | ~50 ns (MMIO write → HEPC scan start) |
The MMIO address space uses a clean dual-BAR layout: BAR0 for Control/Status registers (HEPC lifecycle, DMA management, NMCE descriptor submission at 0x0000–0x2FFF), and BAR2 for the RTBD TAG STORE — a 2 MB direct-access SRAM window containing all 65,536 hardware CAM entries at 32-byte alignment.
Performance Impact: The Numbers
| Metric | Without KV-CPU | With KV-CPU | Improvement |
|---|---|---|---|
| Max context length (70B, 1× H100) | ~20k tokens (HBM OOM) | ~256k tokens (T1-bound) | 12.8× expansion |
| PCIe traffic per KV block scored | 8,192 bytes | 320 bytes | 25.6× reduction |
| Multi-head amortised traffic reduction | — | Up to 128× | via QVB |
| KV block score latency (T1) | 200–600 µs (SW + DMA) | ~0.25 µs (NMCE on-device) | 800–2,400× |
| Eviction decision latency | 200–600 µs (SW syscall) | ~50 ns (HEPC MMIO) | 4,000–12,000× |
| Prefix KV memory @ C=100 requests | 100 × 10.7 GB = 1.07 TB | 1 × 10.7 GB = 10.7 GB | 100× (C-fold) |
| GPU pipeline stalls from KV miss | ~1 per T1 miss (200–600 µs) | ~0 (async prefetch) | Eliminated |
The 12.8× context expansion alone changes the economics of LLM deployment. Today, serving a 128k context request at production scale means multiple H100s just for KV headroom. With a KV-CPU, a single H100 + a single KV-CPU card handles 256k context at meaningful batch concurrency — a cost structure that changes what applications are viable to build.
Patent Status and IP Strategy
The invention is protected by a provisional patent application filed with the Government of India Patent Office under the Patents Act, 1970.
| Field | Detail |
|---|---|
| Application Number | 202641056309 |
| Docket Number | 65779 |
| Reference Number | TEMP/E1/61503/2026-CHE |
| CBR Number | 37184 |
| Filing Office | Chennai Patent Office (CHO) |
| Filing Type | Provisional (Form 2A, Section 9(1)) |
| IPC Classifications | G06F 12/0811 · G06N 3/063 · G06F 13/16 · H01L 25/065 · G06F 9/4401 |
| PCT Window | 12 months from priority date (international filing option) |
The claim architecture uses a four-pillar strategy specifically designed to survive independent prosecution: Claims 1–4 (Pillar I, NMCE), Claims 5–8 (Pillar II, HEPC), Claims 9–12 (Pillar III, RTBD), Claims 13–15 (Pillar IV, Kernel Driver), Claims 16–17 (System). Each pillar has an independent claim that reads on a minimal embodiment — an NMCE-only device, an HEPC+RTBD-only device, or the full system. Even if any individual pillar faces prior art, the others stand.
The key novelty axes that distinguish the invention from prior art:
- vs HBM-PIM (Samsung, SK Hynix): HBM-PIM is a generic GEMV unit GPU-die-attached. It is stateless — no eviction, no prefetch, no request tagging. The NMCE is coupled to the KV lifecycle; it only makes sense in the context of the HEPC and RTBD feedback loop. Different device class, different inventive concept.
- vs CXL Type 3 memory expanders (Samsung CMM-H, Micron CZ120): Purely passive. Zero compute. No policy. No awareness of transformer decode semantics. Adding compute to a CXL expander does not constitute the invention without the closed-loop coupling.
- vs software KV eviction (vLLM, SGLang): Software cannot observe decode-step timing at hardware speed. The S component of the HEPC scoring function is categorically inaccessible to software running at syscall latency.
Open Source: The Complete Repository
Everything is public. The kernel driver, the RTL, the hardware specification, the MMIO blueprint, the packaging and thermal specs, and the UAPI header are all on GitHub.
github.com/manishklach/kv-cpu-drivergit clone https://github.com/manishklach/kv-cpu-driver.git
cd kv-cpu-driver
make
sudo insmod kv_cpu.ko mock=1
# Test the control plane
gcc -o tools/kvctl tools/kvctl.c -I include/uapi
./tools/kvctl step 1
./tools/kvctl hot 0x7f001000 65536
./tools/kvctl prefetch 0x7f100000 65536 256
# Check device
ls /dev/kvcpu*
dmesg | grep kv_cpu
The repository structure:
drivers/misc/kv_cpu/— kernel driver (main, MMIO abstraction, ioctl dispatcher)include/uapi/linux/kv_cpu.h— stable UAPI ABI headerhardware/— SystemVerilog RTL, MMIO spec, master spec, detailed HTML specs, thermal/packagingtools/kvctl.c— reference userspace CLI tooldocs/— architecture overview, Linux integration map, defensibility analysis
What Comes Next
The repo establishes the control-plane architecture and hardware specification. The roadmap from here to a real deployment has four parallel tracks:
Track 1: Kernel RFC
The madvise extension hook — a small patch to mm/madvise.c that allows device drivers to register handlers for custom behavior codes — needs to go to the linux-mm mailing list. This is the most important upstream contribution because it is the mechanism by which the KV-CPU becomes a first-class Linux memory citizen rather than an ioctl-only device. The target is 6.13 or 6.14.
Track 2: vLLM Integration
A custom BlockAllocator subclass for vLLM that calls KV_CPU_SHARE_PREFIX on prefix registration and KV_CPU_STEP_ADVANCE on each decode iteration. This is a pure Python change on the runtime side — the KV-CPU device driver handles everything else. The benchmark target is reproducing the throughput-vs-context-length chart above with empirical measurements on real hardware.
Track 3: Hardware Emulation
A QEMU CXL device that emulates the KV-CPU register interface, enabling the full driver to be exercised in CI without physical silicon. This unblocks continuous integration for the kernel driver and makes the mock mode testable at the hardware interface level rather than just the software API level.
Track 4: Silicon Partner
The architecture is TSMC-node-agnostic and the RTL is written to be synthesizable. A memory company (Micron, Samsung, SK Hynix) looking to differentiate their next-generation CXL product beyond passive capacity, or a startup looking to build focused AI infrastructure silicon, has a well-specified starting point here. The provisional patent window is 12 months from filing — international PCT filing is the next step for global protection.
The inference market is moving fast enough that a chip designed today and taped out in 18 months lands in a world where 256k context is table stakes and 1M context is the frontier. The KV-CPU is designed for that world.
If you're working on inference infrastructure, memory systems research, or CXL device development — the issues tracker is open, PRs are welcome, and the architecture is yours to build on. The patent protects the specific implementation; the problem it solves is big enough that the whole industry benefits from the ideas being public.