KV ASIC Part 3: The Chip

Dedicated silicon is not an inevitable step — it's a specific bet. The bet is that the KV workload is stable enough to justify fixed-function hardware, and distinct enough from general compute that a GPU or CPU can't serve it efficiently. Before that bet is made, you need to know exactly what the chip is doing.

This piece walks through the KV ASIC block by block: the die floorplan, the internal data paths, the schematic for each functional unit, and a grounded estimate of the performance uplift against a baseline software-only stack.

The Die Floorplan

Click any block in the diagram below to see what it does and why it's where it is. The floorplan is designed around one constraint: DMA engines must be adjacent to both the fabric interface and the memory controllers, because every KV transfer must pass through them.

KV Infrastructure ASIC — Interactive Floorplan · Click any block

↑ Click any block above to explore its function

Block-Level Schematics

Each functional block has a distinct internal architecture. The tabs below walk through the four most architecturally interesting units.

DMA Engine

Prefix Dedup

Compression

Eviction

Performance Projections

These numbers are modelled estimates based on published hardware specs, comparable workloads, and conservative architectural assumptions. They are not silicon measurements — treat them as the upper envelope of what well-engineered dedicated hardware could achieve against a software-only baseline running on a CPU-managed DRAM hierarchy.

Baseline Definition

Baseline = vLLM with PagedAttention on 8× H100 SXM, KV managed in host DRAM via CPU, PCIe Gen4 interconnect. No prefix dedup. No hardware compression. Eviction via LRU.

Decode throughput gain

2.4–3.1×

vs. baseline at 128k context, 100 sessions. Primary driver: prefix dedup eliminates ~70% of redundant HBM allocations.

TTFT (time-to-first-token)

−58%

Prefetch scheduler hides CXL read latency behind prompt processing. Cold-start penalty drops from ~120 ms to ~50 ms median.

HBM footprint reduction

4–8×

INT4 compression on low-volatility heads. Allows 4–8× more concurrent sessions per GPU node before HBM spill.

KV movement energy

−70%

Compression reduces bytes transferred. Dedup avoids transfers entirely for shared prefixes. DMA coalescing reduces PCIe overhead.

Throughput Comparison by Workload

Estimated tokens/second per GPU node, relative to software baseline (1.0×). KV ASIC offload assumes the chip sits on a PCIe Gen5 slot adjacent to each GPU.

Short ctx, low concurrency

Baseline

1.3×

32k ctx, 100 sessions

Baseline

1.9×

128k ctx, 100 sessions

Baseline

3.0×

Agentic, shared prefixes

Baseline

>3.5×

Blue = estimated with KV ASIC offload · Gain amplifies with context length and prefix reuse rate

Approximate Power Budget

Estimated TDP for a production KV ASIC at 7nm. Comparison point: H100 SXM TDP ≈ 700W.

DMA Engines (×32)

35W

Compression array (×16)

40W

Memory controllers

25W

Metadata SRAM + dedup

20W

Fabric interface + misc

18W

Estimated total TDP ~138W

~20% of H100 power · enables 1 KV ASIC per 4 GPUs in standard rack power budgets

Comparison: ASIC vs. Software-Only vs. CXL-Only

Capability	Software Only	CXL Expansion	KV ASIC
Prefix deduplication	SW, high CPU cost	None	HW, <100ns
KV compression	SW, latency hit	None	HW, inline, 1.2 TB/s
Eviction policy	LRU / custom SW	Generic LRU	Decode-urgency-aware
Prefetch scheduling	None / manual	None	Urgency model, HW
Topology awareness	None	Partial (CXL only)	Full hierarchy
Session isolation / QoS	SW, coarse	None	Per-session SLA
Observability	SW tracing	None	HW perf counters
Complexity / risk	Low	Low	High (new silicon)

The gains are real — but they compound only when the workload demands them. At short context and low concurrency, the ASIC is an expensive NIC. At 128k context and high concurrency, it may be the difference between viable and not.

— architectural tradeoff summary

Memory Timing Deep Dive

Every claim about latency hides a ladder of sub-operations. Here is what actually happens, in order, when a GPU decode step requests a KV block that has been demoted to CXL:

KV Block Retrieval — Full Timing Breakdown (CXL warm tier)

Two microseconds is a long time at GPU speeds — an H100 decodes roughly 10,000 FLOPs in that window. The prefetch scheduler's job is to ensure that the CXL read (steps 5–6, the longest segment) is in flight before the GPU finishes the preceding decode step. When it succeeds, the effective latency seen by the GPU is just the final DMA write: ~400 ns. That is the difference between a stall and transparency.

The NVMe cold tier is worse by roughly 2 orders of magnitude — a full NVMe read adds ~100 µs. This is why archived sessions must be explicitly warmed into CXL before a session resumes, not fetched on-demand. The prefetch scheduler handles this: when a session is scheduled to resume, a warm-up prefetch is issued 500 µs in advance.

The Programming Model

Hardware is only as useful as the software contract above it. The KV ASIC exposes four surfaces to the inference stack:

KV ASIC Programming Model — Interface Layers

The key design decision in the API is that the inference runtime speaks in semantics, not addresses. It calls kv_block_fetch(session, layer, head_range, token_span) — and the ASIC resolves that semantic request to a physical address, selects the tier, initiates the DMA, applies decompression if needed, and returns a fence. The runtime never knows or cares whether the block came from HBM, CXL, or NVMe.

This separation is load-bearing. It means the eviction policy, compression strategy, and tier placement can all be changed — or learned — without modifying the inference runtime. The ASIC is the policy engine; the runtime is the policy consumer.

Session Lifecycle

Every KV session passes through five states. The ASIC enforces valid transitions in hardware — an invalid transition (e.g. fetching a block from an ARCHIVED session without first issuing a warm-up) returns an error code and a DMA fence that resolves to a zero block, preventing silent corruption.

Session State Machine

Signal Integrity & Physical Constraints

At 7nm with PCIe Gen5 signalling, the chip's physical design has non-trivial constraints. These aren't theoretical — they determine what the floorplan can and cannot look like.

Interface	Signal rate	Constraint	Mitigation
PCIe Gen5	32 GT/s per lane	Trace length matching ±5 mil, EQ training on link-up	On-die retimer, adaptive equaliser
CXL 3.0	32 GT/s (shared PHY with PCIe)	Protocol negotiation adds ~10 µs at link-up; cache coherency protocol overhead	Dedicated CXL arbiter; FLIT-mode to reduce protocol overhead by ~40%
HBM3e	~3.35 TB/s per stack	Must be within 2mm of die for signal integrity; bump pitch 55 µm	2.5D interposer (TSMC CoWoS or equivalent); up to 4 HBM stacks
DDR5	~64 GB/s per channel	Maximum trace length 5 inches at 6400 MT/s; power delivery noise	On-package decoupling; up to 8 channels (512 GB/s aggregate)
NVMe P2P DMA	PCIe Gen4×4 per SSD	P2P DMA bypasses host CPU/DRAM; requires IOMMU passthrough	BAR-to-BAR DMA; DMA completion interrupt via MSI-X

The HBM constraint is the most significant. Placing HBM stacks on a 2.5D interposer (CoWoS-style) adds ~$800–1,200 per unit to packaging cost, but it is the only practical way to achieve the ~3 TB/s bandwidth the active decode tier needs. Without it, the chip's hot tier degrades to DDR5 speeds (~500 GB/s), which is insufficient for high-concurrency long-context serving.

An alternative design — using HBM only on the GPU and serving the ASIC exclusively from CXL — is viable at lower concurrency but collapses under 128k-context loads where the GPU's HBM fills entirely with active decode state and the ASIC has nowhere to stage warm blocks fast enough.

Multi-Chip Topology

A single KV ASIC can serve 2–4 GPUs depending on context length and concurrency. In a large cluster, multiple ASICs form a fabric — and the KV metadata must be coordinated across them. Here is the reference cluster topology for a 32-GPU node:

32-GPU Node — KV ASIC Cluster Topology

The metadata coordinator is the most architecturally sensitive component in the cluster. It maintains a distributed registry of all live prefix blocks so that a session on ASIC 3 can reuse a prefix block that was first created by ASIC 7. Without it, prefix deduplication is local-only and the gain degrades from cluster-wide to per-chip.

The coordinator uses Raft consensus with a 3-replica write quorum. Typical write latency is ~200 µs at 10 Gbps interconnect — fast enough for session creation (which happens once) but too slow for per-block lookup. Block lookups therefore use a local SRAM cache of the prefix registry, refreshed asynchronously from the coordinator. Cache staleness is bounded to 1 ms, which means a prefix created in the last millisecond may not be shared cluster-wide — an acceptable tradeoff for near-zero lookup latency.

Failure Modes & Correctness Risks

This section exists because every serious infrastructure proposal must enumerate its ways of breaking. These are not edge cases — they are the expected failure modes that a production deployment will encounter.

Failure Mode	Trigger	Impact	Mitigation
Stale prefix block reuse	Hash collision or cache staleness causes wrong prefix to be served	Silent inference corruption — model generates from wrong context	Full prefix hash verification on first use per session; ECC-protected SRAM; collision count telemetry
DMA fence timeout	NVMe read takes >5 ms; fabric congestion; ASIC internal deadlock	Decode stall; timeout escalates to CPU; session may be killed	Watchdog timer per DMA channel; fallback to CPU-managed copy; circuit breaker on ASIC timeout rate
Metadata coordinator partition	Network split between coordinator replicas	Prefix sharing degrades to per-chip; no cross-ASIC reuse	Raft with split-brain prevention; ASICs continue with local-only dedup; graceful degradation, not crash
Compression quality regression	Model update changes head volatility profile; INT4 applied to now-high-volatility heads	Perplexity degradation; potentially silent (no hard error)	Per-model codec profile registered at session create; profile version checked on block decompress
Eviction of decode-active block	Bug in urgency score computation; race between score update and eviction decision	Decode stall or fetch of zero block; visible output corruption	Hardware lock on DECODE-state session; eviction engine cannot mark protected blocks; verified in RTL
HBM ECC failure	Cosmic ray or wear-out on on-die HBM	Single-bit: corrected silently. Multi-bit: session killed, block invalidated	SECDED ECC on all HBM; multi-bit error triggers session fence and re-prefill from cold storage
CXL link reset	PCIe/CXL link retraining during active decode	All in-flight DMA aborted; warm tier unavailable for ~50 ms retrain window	In-flight DMA list persisted to SRAM; re-issued on link restore; hot tier expands to absorb during retrain

Engineering Priority

The two silent failure modes — stale prefix reuse and compression quality regression — are the hardest to detect in production because they produce wrong output without raising errors. Both require explicit detection logic: prefix hash verification at the session boundary, and per-model codec profile pinning. Neither is expensive to implement; both are easy to omit in a first implementation.

Silicon Roadmap: KV-1 → KV-3

A single chip generation is not a product. The roadmap matters for the investment case. Here is a plausible three-generation arc, assuming a 2-year cadence and stable process node progression:

Generation	Process	New Capabilities	Target Workload
KV-1 (this chip)	7nm	HBM3e hot tier · INT4/FP8 compression · prefix dedup · PCIe Gen5/CXL 3.0	128k context, 100+ concurrent sessions, single node
KV-2	4nm / 3nm	On-die ML-based eviction predictor · learned compression codecs · cluster-wide prefix fabric · NVLink 5 support · 2× DMA channels	1M+ token context windows · multi-node prefix sharing · agentic memory persistence
KV-3	2nm / 1.8nm	In-memory attention scoring for prefix reuse decisions · KV-native compression at FP4 · cross-datacenter session handoff · hardware-enforced privacy domains	Persistent agent memory at datacenter scale · multi-tenant isolation · cross-region session continuity

The most significant jump is KV-2's on-die ML predictor for eviction and prefetch. Instead of a fixed weighted-logit scorer, KV-2 would run a small (~1M parameter) learned model that predicts decode urgency from session features — improving prefetch accuracy from ~75% to an estimated ~92% hit rate on realistic workload mixes. This predictor is updated offline from telemetry traces and loaded into on-die SRAM at boot.

KV-3's hardware privacy domains are an enterprise requirement: in multi-tenant serving, one tenant's KV state must be cryptographically isolated from another's. At software layer this is expensive; in hardware it reduces to address translation with per-domain keys, which is a solved problem in security processors.

What the ASIC Does Not Replace

It is worth being explicit about scope, because infrastructure bets have a way of expanding past their original boundary.

Not replaced

The inference runtime

vLLM, SGLang, TGI — these own scheduling, batching, and the forward pass. The ASIC is subordinate to the runtime, not a replacement for it.

Not replaced

The GPU accelerator

All attention computation, FFN, and sampling remain on GPU. The ASIC only handles KV memory — it never touches activations or model weights.

Not replaced

Model quality decisions

How much compression is acceptable is a model-specific empirical question. The ASIC provides the mechanism; the operator sets the policy.

Not replaced

Request routing

Which GPU processes which request is still decided by the orchestrator (Kubernetes, Dynamo, etc.). The ASIC improves the memory layer beneath that decision.

What Comes Next

Part 4 turns to the software stack: the full KV semantic API specification, the session metadata protocol between runtimes and the ASIC driver, and the changes needed in vLLM and SGLang to consume a hardware KV layer without degrading their existing scheduling correctness.

The silicon described here is the physical contract. The protocol above it is the programming contract. Getting the protocol right is harder — because it requires consensus across the inference runtime ecosystem, not just a correct RTL implementation.