Deep Technical Post · May 2026

The Memory Chip LLM Inference Servers Don't Know They Need

Introducing the KV-CPU: a purpose-built CXL 3.0 device that implements a closed-loop hardware KV-cache lifecycle system — eliminating the GPU pipeline stalls and PCIe bandwidth walls that silently throttle every large-scale inference deployment today.

Patent Pending · India ~18 min read Open Source · github.com/manishklach/kv-cpu-driver
⚖️
Patent Pending — Government of India
Provisional Application filed under the Patents Act, 1970 · Docket No. 65779 · App. No. 202641056309
Reference: TEMP/E1/61503/2026-CHE · CBR: 37184 · Filed: CHO Patent Office

The Problem Nobody Is Talking About

The AI industry is obsessed with compute. More GPUs, faster chips, higher FLOPS. But the real bottleneck in production LLM inference — the thing that silently caps your throughput and balloons your cost — is not compute. It's memory orchestration.

Here's the arithmetic nobody wants to do publicly. A 70-billion parameter model running in FP16 needs to store its KV-cache — the accumulated key and value attention vectors from every token in the context — for as long as the request is active. The cache grows linearly with context length:

KV_memory(L) = 2 × N_layers × H_heads × D_head × L × sizeof(dtype)

Plug in a real model: 80 layers, 64 heads, head dimension 128, FP16. At a 128,000-token context — increasingly standard for agents, long-document tasks, and multi-turn sessions — that's 320 GB per active request. An H100 has 80 GB of HBM. You can do the math.

320 GB
KV-cache per request at 128k context (70B FP16)
80 GB
Total HBM on an H100 — less than weights + one request
200–600 µs
Software KV eviction latency (syscall path)
~50 ns
KV-CPU hardware eviction trigger (MMIO write)

The existing solutions — vLLM's PagedAttention, TRT-LLM's chunked prefill, simple NVMe offload — are all software patches on a hardware problem. They share a fatal flaw: the operating system is semantically blind. The kernel's LRU page replacement policy has no idea what a "decode step" is. It doesn't know that a block of KV vectors accessed 300 decode steps ago will almost certainly be needed again when the model's attention sweeps back toward the beginning of the context. It evicts based on recency of memory access, not recency of transformer-semantic relevance.

The result: unnecessary evictions, cold cache misses on the hot path, and GPU compute pipelines stalling to wait for PCIe DMA transfers that should have been prefetched 100 milliseconds ago.

Existing KV offload is like asking a filing clerk to manage a nuclear reactor: they'll move papers when asked, but they have no idea which papers are critical or when you'll need them next.

The Key Insight: Semantic Memory

The insight behind the KV-CPU is simple to state and hard to execute: the hardware should understand the transformer decode loop.

Autoregressive decoding has a structure that LRU/LFU policies completely ignore. Every time the model generates a new token, it is at decode step t. Every KV block in the cache has a known last-access step. The distance between the current step and a block's last access step — the step proximity — is the single most predictive signal for whether that block will be needed in the next 50–200 steps. No software LRU policy has this signal. A hardware device that receives the current decode step via a single MMIO write does.

This is the fundamental shift the KV-CPU proposes: move KV-cache policy from a reactive software loop running on the host CPU to a predictive hardware policy engine that operates in silicon at nanosecond latency, informed by transformer inference semantics that the OS has historically never received.

Four-tier memory hierarchy
Figure 1 — The KV-CPU four-tier memory hierarchy. T1 is the only tier with both memory capacity and on-device compute.

Architecture: A Closed-Loop System

The KV-CPU is not a memory expander with some compute bolted on. It is a closed-loop KV block lifecycle system: a single CXL 3.0 device where four co-integrated hardware components form a continuous feedback loop that runs every decode step without any host CPU involvement.

KV-CPU closed loop architecture
Figure 2 — The closed-loop feedback architecture. NMCE output informs HEPC priority; HEPC drives RTBD tier placement; RTBD enables NMCE block location. One loop per decode step.

The loop works as follows. The GPU writes the current decode step t to a hardware register (a single 64-bit MMIO write, ~50 ns). This triggers the HEPC to scan all tracked KV blocks, recompute their priority scores, schedule low-priority blocks for async eviction to T2/T3, and proactively stage high-priority blocks into T1. Meanwhile, the GPU is already starting its next decode step — it never waits. The NMCE handles attention scoring for any blocks that land in T1, returning only scalar scores to the GPU across PCIe instead of raw key vectors. The loop closes when those score operations update the RTBD's access metadata, which feeds the next HEPC priority scan.

The central inventive concept: It is not any single component but the coupling — NMCE output semantically informs HEPC priority; HEPC drives RTBD tier placement; RTBD lookup enables NMCE compute — that makes this architecture novel and hard to design around.

The four pillars are independently patentable and are claimed independently in the patent, so each can be pursued even if any individual pillar faces prior art:

Pillar I

NMCE — Near-Memory Compute Engine

Computes attention score dot products from key vectors resident in on-device LPDDR5X. Only scalar scores traverse PCIe. 25.6×–128× traffic reduction.

Pillar II

HEPC — Hardware Eviction & Prefetch Controller

Fixed-function FSM computing P(Bᵢ) = wᵣR + w_fF + w_sS + w_dD in silicon. Decode-step-aware eviction triggered at ~50 ns via MMIO.

Pillar III

RTBD — Request-Tagged Block Directory

Hardware SRAM CAM with per-request isolation and hardware reference-counted prefix sharing. Saves C × prefix_KV_GB for C concurrent requests.

Pillar IV

Kernel Driver — Hardware Control Plane

Linux kernel driver as hardware control plane. Every madvise call, io_uring opcode, and ioctl terminates in a hardware register write to HEPC or RTBD silicon.

Pillar I: The NMCE — Rethinking Where Attention Happens

Today, when an inference server needs to compute attention scores for a KV block that has been offloaded from GPU HBM to T1 DRAM, it does the following: fetch the entire key block across PCIe (8,192 bytes for D=128, B=32, FP16), compute the dot products on the GPU, and discard the key data. The only thing the GPU actually needed from that transfer was a vector of 32 scalar scores — 64 bytes.

The NMCE eliminates the key vector transfer entirely. It is a fixed-function dot-product array embedded in the KV-CPU logic die, physically adjacent to the LPDDR5X memory controller. The GPU sends a query vector Q_h (256 bytes) and a memory address range. The NMCE fetches the key block from T1 over the on-package interconnect — never touching PCIe — computes the scores, and sends back 64 bytes of scalars.

NMCE PCIe traffic reduction
Figure 3 — Left: Per-block traffic reduction (25.6× for single head). Right: Amortised reduction via the Query Vector Buffer as head count increases, approaching 128×.
NMCE UAPI — how an LLM runtime submits a compute descriptor
/* Signal step t=128 → triggers HEPC scan + NMCE pipeline setup */
struct kv_cpu_step_info step = { .step = 128 };
ioctl(fd, KV_CPU_STEP_ADVANCE, &step);

/* Mark prefix KV range as permanently hot */
struct kv_cpu_block_info hot = {
    .va  = (uint64_t)prefix_kv_ptr,
    .len = prefix_kv_len,
};
ioctl(fd, KV_CPU_MARK_HOT, &hot);

/* Hint that a specific block will be needed at step 256 */
struct kv_cpu_block_info pf = {
    .va          = (uint64_t)future_block_ptr,
    .len         = block_size,
    .target_step = 256,
};
ioctl(fd, KV_CPU_PREFETCH, &pf);

The NMCE is not a general-purpose PIM (Processing-In-Memory) unit. It does not run GEMM or generic vector operations. Its arithmetic circuit is specifically designed for scaled dot-product attention — query × key with 1/√D scaling and a piecewise-linear exp() approximation. This specificity is both a feature (efficiency) and a patent moat: a competitor adding generic ALUs to their memory chips does not replicate the NMCE without explicitly targeting transformer attention semantics.

Pillar II: The HEPC — Making Eviction Transformer-Aware

The HEPC is the policy brain of the KV-CPU. At its core is a composite priority scoring formula computed in fixed-function combinational logic:

P(Bᵢ) = wᵣ · R(Bᵢ) + w_f · F(Bᵢ) + w_s · S(Bᵢ) + w_d · D(Bᵢ) (mod 2¹⁶)

Each component captures a different dimension of KV block importance:

ComponentSignalWhy it matters for transformers
R(Bᵢ) — RecencySaturating counter, reset on access, decays per stepRecently used blocks likely to be used again soon — same intuition as LRU but in decode steps, not wall-clock time
F(Bᵢ) — FrequencySaturating counter, incremented on each accessIdentifies working set — system prompt blocks touched every step saturate quickly
S(Bᵢ) — Step Proximitymax(0, W − (t_current − t_last_access))The novel component. No LRU/LFU/ARC policy has this signal. Requires knowing the current decode step from the GPU — impossible for any software-only eviction policy.
D(Bᵢ) — Prefix DependencyBinary flag from RTBD is_prefix field, weighted maximallyShared prefix blocks must survive eviction while any request references them — enforced in hardware, sub-100 ns
HEPC priority score evolution
Figure 4 — R, F, S, and composite P(Bᵢ) across 120 decode steps for a block last accessed at t=0, 15, 30. The HEPC autonomously evicts when P falls below threshold — no CPU intervention.

Once the GPU writes to the step-advance register, the HEPC runs its three-phase cycle entirely in hardware — scan all RTBD entries, evict low-priority blocks via async DMA to T2/T3, prefetch high-priority blocks from T2/T3 to T1. The eviction and prefetch DMA engines run on separate channels so they never serialise. The GPU is already computing the next token. Nobody waits.

Why software can't replicate this: A software eviction policy running on the host CPU incurs a minimum of 200–600 µs per decision (kernel scheduling + syscall overhead). The HEPC responds to a step-advance signal in ~50 ns — a 4,000× to 12,000× latency reduction. At 50 decode steps per second with 65,000 tracked KV blocks, software would spend nearly all its time on eviction decisions. The HEPC does it for free.

Pillar III: The RTBD — Hardware-Enforced Multi-Tenant Isolation

Production inference deployments are multi-tenant by default. Dozens or hundreds of concurrent requests share the same GPU, the same KV-CPU, and — critically — often the same system prompt prefix. Without hardware isolation, KV blocks from different requests can collide, corrupt each other's cache, or compete unfairly for eviction priority.

The RTBD is a fully-associative Content-Addressable Memory (CAM) in SRAM on the KV-CPU logic die, supporting up to 65,536 tracked KV blocks simultaneously. Each entry is a 240-bit descriptor covering: request_id (16b), layer_idx (8b), head_idx (8b), token range (64b), current tier location (2b: GPU/T1/T2/T3), physical address (64b), priority score (16b), prefix flag (1b), reference count (8b), access step (32b), and dirty bit (1b).

The prefix sharing mechanism is the most commercially interesting feature. In a system with 100 concurrent requests all using the same 4,096-token system prompt, a naïve system stores 100 copies of ~10.7 GB of prefix KV vectors — 1.07 TB of T1 for prefix alone. The RTBD's hardware reference counting stores exactly one copy, tagged with request_id=0x0000 (shared) and is_prefix=1. The reference counter prevents the HEPC from evicting the block while any request holds a reference. 100 requests, 10.7 GB — a 100× T1 capacity saving on the single most memory-intensive data structure in production LLM serving.

RTBD interaction via UAPI
/* When a new request joins sharing a common prefix */
struct kv_cpu_block_info share = {
    .va  = (uint64_t)prefix_kv_block_pa,
    .len = prefix_block_size,
};
ioctl(fd, KV_CPU_SHARE_PREFIX, &share);
/* → Hardware increments ref_count; HEPC cannot evict this block */

/* When request terminates */
ioctl(fd, KV_CPU_EVICT, &share);
/* → Hardware decrements ref_count; eviction eligible only when ref_count == 0 */

Pillar IV: The Kernel Driver — Hardware Control Plane

The Linux kernel driver (kv_cpu.ko) is Pillar IV — and it is the piece that transforms the KV-CPU from an interesting chip idea into a deployable system. It is also the piece that people most frequently misunderstand.

The driver is not a software invention. It is a hardware control plane. Here is the critical distinction: every API the driver exposes — every ioctl, every madvise call, every io_uring opcode — terminates in a write to a hardware register in the KV-CPU silicon. The HEPC priority scoring cannot function correctly without receiving the decode step via the step-advance register write. The RTBD eviction guard for prefix blocks cannot activate without the RTBD_SHARE register command. Remove the driver and the hardware loop cannot close. Remove the hardware and the driver is meaningless. They are co-dependent.

KV-CPU system software stack
Figure 5 — The system software stack. Every API call at the userspace/kernel boundary terminates in a hardware register write to HEPC or RTBD silicon.

The driver architecture in the repository under drivers/misc/kv_cpu/ implements:

kvctl — the reference userspace tool
# Load the driver (mock mode — no hardware required)
sudo insmod kv_cpu.ko mock=1

# Signal decode step 128 → triggers HEPC hardware scan
./tools/kvctl step 128

# Mark a KV block range as hot (protect from eviction)
./tools/kvctl hot 0x7f001000 4096

# Schedule immediate eviction
./tools/kvctl evict 0x7f005000 4096

# Hint that a block will be needed at step 256
./tools/kvctl prefetch 0x7f009000 4096 256

# Mark a prefix range as shared across requests
./tools/kvctl share 0x7f010000 10737418240

The driver is structured for upstreaming. The include/uapi/linux/kv_cpu.h header defines the stable ABI contract between userspace LLM runtimes and the kernel, following Linux UAPI conventions. The source lives under drivers/misc/kv_cpu/ — the correct location for a new device class pending its own drivers/kvcpu/ subsystem.

Low-latency step-advance path: For the absolute hot path — GPU-side processes that need sub-100 ns step-advance without any syscall overhead — the driver exposes a BAR0 mmap window. A userspace process can memory-map the HEPC control registers and write the step-advance register directly. On real hardware this is a single PCIe TLP write with end-to-end latency under 100 ns.

Hardware Design: From Concept to Silicon Specification

The repository includes not just the kernel driver but a full hardware specification covering RTL, MMIO address maps, packaging, and thermal design. This is unusual for an open-source patent-pending project — and intentional. The goal is to provide enough detail that a chip company, foundry partner, or university research group could begin an implementation with this repo as the starting point.

HEPC SystemVerilog RTL
Figure 6 — The hepc_priority_engine SystemVerilog module. Step-advance trigger feeds the priority scorer; DMA triggers fire based on threshold comparisons with the guard logic blocking eviction of referenced prefix blocks.

The SystemVerilog RTL files in hardware/ implement the two most complex components:

kv_cpu_hepc_scorer.sv — The HEPC priority engine. A purely combinational design: step_diff = current_step - last_access_step, then S_component = (step_diff < 255) ? (255 - step_diff) : 0, then the composite score p_score = (w_r*R) + (w_f*F) + (w_s*S) + (w_d*D_flag). The DMA trigger logic includes the guard: if (is_prefix && ref_count > 0) trigger_evict_dma = 0. This guard is critical — it is the hardware implementation of the RTBD's eviction protection for shared prefix blocks.

kv_cpu_nmce_dpu.sv — The NMCE dot-product unit array. A 128-wide systolic MAC array parameterized over head dimension D and block size B. Computes score[i] = approx_exp(dot_product(Q, K[i]) * scale_factor) for all B key vectors in a single pipeline pass. The approx_exp() function is a piecewise-linear approximation of sufficient accuracy (±0.1% relative error) for attention score ranking.

The silicon specification targets:

ParameterSpecification
Process nodeTSMC N5 or N7 FinFET
NMCE DPU array128× FP16/BF16 MAC units · 256 GFLOPS on-device
RTBD CAM65,536 entries · 240 bits per entry · SRAM
T1 memory128–512 GB LPDDR5X · 4× 32-bit channels · 256 GB/s
Host interfaceCXL 3.0 / PCIe 5.0 (x16 · 64 GB/s)
TDP20–28 W active · 8–12 W idle · 3 W standby
Core voltage0.75V Vdd
Form factorPCIe HHHL or CXL EDSFF E3.S
Step-advance latency~50 ns (MMIO write → HEPC scan start)

The MMIO address space uses a clean dual-BAR layout: BAR0 for Control/Status registers (HEPC lifecycle, DMA management, NMCE descriptor submission at 0x0000–0x2FFF), and BAR2 for the RTBD TAG STORE — a 2 MB direct-access SRAM window containing all 65,536 hardware CAM entries at 32-byte alignment.

Performance Impact: The Numbers

Throughput vs context length
Figure 7 — Projected token throughput vs context length (70B model, 1× H100). Illustrative based on memory tier latency analysis — not empirical measurements from physical hardware.
MetricWithout KV-CPUWith KV-CPUImprovement
Max context length (70B, 1× H100)~20k tokens (HBM OOM)~256k tokens (T1-bound)12.8× expansion
PCIe traffic per KV block scored8,192 bytes320 bytes25.6× reduction
Multi-head amortised traffic reductionUp to 128×via QVB
KV block score latency (T1)200–600 µs (SW + DMA)~0.25 µs (NMCE on-device)800–2,400×
Eviction decision latency200–600 µs (SW syscall)~50 ns (HEPC MMIO)4,000–12,000×
Prefix KV memory @ C=100 requests100 × 10.7 GB = 1.07 TB1 × 10.7 GB = 10.7 GB100× (C-fold)
GPU pipeline stalls from KV miss~1 per T1 miss (200–600 µs)~0 (async prefetch)Eliminated

The 12.8× context expansion alone changes the economics of LLM deployment. Today, serving a 128k context request at production scale means multiple H100s just for KV headroom. With a KV-CPU, a single H100 + a single KV-CPU card handles 256k context at meaningful batch concurrency — a cost structure that changes what applications are viable to build.

Patent Status and IP Strategy

The invention is protected by a provisional patent application filed with the Government of India Patent Office under the Patents Act, 1970.

FieldDetail
Application Number202641056309
Docket Number65779
Reference NumberTEMP/E1/61503/2026-CHE
CBR Number37184
Filing OfficeChennai Patent Office (CHO)
Filing TypeProvisional (Form 2A, Section 9(1))
IPC ClassificationsG06F 12/0811 · G06N 3/063 · G06F 13/16 · H01L 25/065 · G06F 9/4401
PCT Window12 months from priority date (international filing option)

The claim architecture uses a four-pillar strategy specifically designed to survive independent prosecution: Claims 1–4 (Pillar I, NMCE), Claims 5–8 (Pillar II, HEPC), Claims 9–12 (Pillar III, RTBD), Claims 13–15 (Pillar IV, Kernel Driver), Claims 16–17 (System). Each pillar has an independent claim that reads on a minimal embodiment — an NMCE-only device, an HEPC+RTBD-only device, or the full system. Even if any individual pillar faces prior art, the others stand.

The key novelty axes that distinguish the invention from prior art:

Open Source: The Complete Repository

Everything is public. The kernel driver, the RTL, the hardware specification, the MMIO blueprint, the packaging and thermal specs, and the UAPI header are all on GitHub.

github.com/manishklach/kv-cpu-driver
Quick start — load in mock mode (no hardware required)
git clone https://github.com/manishklach/kv-cpu-driver.git
cd kv-cpu-driver
make
sudo insmod kv_cpu.ko mock=1

# Test the control plane
gcc -o tools/kvctl tools/kvctl.c -I include/uapi
./tools/kvctl step 1
./tools/kvctl hot 0x7f001000 65536
./tools/kvctl prefetch 0x7f100000 65536 256

# Check device
ls /dev/kvcpu*
dmesg | grep kv_cpu

The repository structure:

What Comes Next

The repo establishes the control-plane architecture and hardware specification. The roadmap from here to a real deployment has four parallel tracks:

Track 1: Kernel RFC

The madvise extension hook — a small patch to mm/madvise.c that allows device drivers to register handlers for custom behavior codes — needs to go to the linux-mm mailing list. This is the most important upstream contribution because it is the mechanism by which the KV-CPU becomes a first-class Linux memory citizen rather than an ioctl-only device. The target is 6.13 or 6.14.

Track 2: vLLM Integration

A custom BlockAllocator subclass for vLLM that calls KV_CPU_SHARE_PREFIX on prefix registration and KV_CPU_STEP_ADVANCE on each decode iteration. This is a pure Python change on the runtime side — the KV-CPU device driver handles everything else. The benchmark target is reproducing the throughput-vs-context-length chart above with empirical measurements on real hardware.

Track 3: Hardware Emulation

A QEMU CXL device that emulates the KV-CPU register interface, enabling the full driver to be exercised in CI without physical silicon. This unblocks continuous integration for the kernel driver and makes the mock mode testable at the hardware interface level rather than just the software API level.

Track 4: Silicon Partner

The architecture is TSMC-node-agnostic and the RTL is written to be synthesizable. A memory company (Micron, Samsung, SK Hynix) looking to differentiate their next-generation CXL product beyond passive capacity, or a startup looking to build focused AI infrastructure silicon, has a well-specified starting point here. The provisional patent window is 12 months from filing — international PCT filing is the next step for global protection.

The inference market is moving fast enough that a chip designed today and taped out in 18 months lands in a world where 256k context is table stakes and 1M context is the frontier. The KV-CPU is designed for that world.

If you're working on inference infrastructure, memory systems research, or CXL device development — the issues tracker is open, PRs are welcome, and the architecture is yours to build on. The patent protects the specific implementation; the problem it solves is big enough that the whole industry benefits from the ideas being public.