Dedicated silicon is not an inevitable step — it's a specific bet. The bet is that the KV workload is stable enough to justify fixed-function hardware, and distinct enough from general compute that a GPU or CPU can't serve it efficiently. Before that bet is made, you need to know exactly what the chip is doing.
This piece walks through the KV ASIC block by block: the die floorplan, the internal data paths, the schematic for each functional unit, and a grounded estimate of the performance uplift against a baseline software-only stack.
The Die Floorplan
Click any block in the diagram below to see what it does and why it's where it is. The floorplan is designed around one constraint: DMA engines must be adjacent to both the fabric interface and the memory controllers, because every KV transfer must pass through them.
Block-Level Schematics
Each functional block has a distinct internal architecture. The tabs below walk through the four most architecturally interesting units.
Performance Projections
These numbers are modelled estimates based on published hardware specs, comparable workloads, and conservative architectural assumptions. They are not silicon measurements — treat them as the upper envelope of what well-engineered dedicated hardware could achieve against a software-only baseline running on a CPU-managed DRAM hierarchy.
Throughput Comparison by Workload
Estimated tokens/second per GPU node, relative to software baseline (1.0×). KV ASIC offload assumes the chip sits on a PCIe Gen5 slot adjacent to each GPU.
Approximate Power Budget
Estimated TDP for a production KV ASIC at 7nm. Comparison point: H100 SXM TDP ≈ 700W.
Comparison: ASIC vs. Software-Only vs. CXL-Only
| Capability | Software Only | CXL Expansion | KV ASIC |
|---|---|---|---|
| Prefix deduplication | SW, high CPU cost | None | HW, <100ns |
| KV compression | SW, latency hit | None | HW, inline, 1.2 TB/s |
| Eviction policy | LRU / custom SW | Generic LRU | Decode-urgency-aware |
| Prefetch scheduling | None / manual | None | Urgency model, HW |
| Topology awareness | None | Partial (CXL only) | Full hierarchy |
| Session isolation / QoS | SW, coarse | None | Per-session SLA |
| Observability | SW tracing | None | HW perf counters |
| Complexity / risk | Low | Low | High (new silicon) |
The gains are real — but they compound only when the workload demands them. At short context and low concurrency, the ASIC is an expensive NIC. At 128k context and high concurrency, it may be the difference between viable and not.
— architectural tradeoff summaryMemory Timing Deep Dive
Every claim about latency hides a ladder of sub-operations. Here is what actually happens, in order, when a GPU decode step requests a KV block that has been demoted to CXL:
Two microseconds is a long time at GPU speeds — an H100 decodes roughly 10,000 FLOPs in that window. The prefetch scheduler's job is to ensure that the CXL read (steps 5–6, the longest segment) is in flight before the GPU finishes the preceding decode step. When it succeeds, the effective latency seen by the GPU is just the final DMA write: ~400 ns. That is the difference between a stall and transparency.
The NVMe cold tier is worse by roughly 2 orders of magnitude — a full NVMe read adds ~100 µs. This is why archived sessions must be explicitly warmed into CXL before a session resumes, not fetched on-demand. The prefetch scheduler handles this: when a session is scheduled to resume, a warm-up prefetch is issued 500 µs in advance.
The Programming Model
Hardware is only as useful as the software contract above it. The KV ASIC exposes four surfaces to the inference stack:
The key design decision in the API is that the inference runtime speaks in semantics, not addresses. It calls kv_block_fetch(session, layer, head_range, token_span) — and the ASIC resolves that semantic request to a physical address, selects the tier, initiates the DMA, applies decompression if needed, and returns a fence. The runtime never knows or cares whether the block came from HBM, CXL, or NVMe.
This separation is load-bearing. It means the eviction policy, compression strategy, and tier placement can all be changed — or learned — without modifying the inference runtime. The ASIC is the policy engine; the runtime is the policy consumer.
Session Lifecycle
Every KV session passes through five states. The ASIC enforces valid transitions in hardware — an invalid transition (e.g. fetching a block from an ARCHIVED session without first issuing a warm-up) returns an error code and a DMA fence that resolves to a zero block, preventing silent corruption.
Signal Integrity & Physical Constraints
At 7nm with PCIe Gen5 signalling, the chip's physical design has non-trivial constraints. These aren't theoretical — they determine what the floorplan can and cannot look like.
| Interface | Signal rate | Constraint | Mitigation |
|---|---|---|---|
| PCIe Gen5 | 32 GT/s per lane | Trace length matching ±5 mil, EQ training on link-up | On-die retimer, adaptive equaliser |
| CXL 3.0 | 32 GT/s (shared PHY with PCIe) | Protocol negotiation adds ~10 µs at link-up; cache coherency protocol overhead | Dedicated CXL arbiter; FLIT-mode to reduce protocol overhead by ~40% |
| HBM3e | ~3.35 TB/s per stack | Must be within 2mm of die for signal integrity; bump pitch 55 µm | 2.5D interposer (TSMC CoWoS or equivalent); up to 4 HBM stacks |
| DDR5 | ~64 GB/s per channel | Maximum trace length 5 inches at 6400 MT/s; power delivery noise | On-package decoupling; up to 8 channels (512 GB/s aggregate) |
| NVMe P2P DMA | PCIe Gen4×4 per SSD | P2P DMA bypasses host CPU/DRAM; requires IOMMU passthrough | BAR-to-BAR DMA; DMA completion interrupt via MSI-X |
The HBM constraint is the most significant. Placing HBM stacks on a 2.5D interposer (CoWoS-style) adds ~$800–1,200 per unit to packaging cost, but it is the only practical way to achieve the ~3 TB/s bandwidth the active decode tier needs. Without it, the chip's hot tier degrades to DDR5 speeds (~500 GB/s), which is insufficient for high-concurrency long-context serving.
An alternative design — using HBM only on the GPU and serving the ASIC exclusively from CXL — is viable at lower concurrency but collapses under 128k-context loads where the GPU's HBM fills entirely with active decode state and the ASIC has nowhere to stage warm blocks fast enough.
Multi-Chip Topology
A single KV ASIC can serve 2–4 GPUs depending on context length and concurrency. In a large cluster, multiple ASICs form a fabric — and the KV metadata must be coordinated across them. Here is the reference cluster topology for a 32-GPU node:
The metadata coordinator is the most architecturally sensitive component in the cluster. It maintains a distributed registry of all live prefix blocks so that a session on ASIC 3 can reuse a prefix block that was first created by ASIC 7. Without it, prefix deduplication is local-only and the gain degrades from cluster-wide to per-chip.
The coordinator uses Raft consensus with a 3-replica write quorum. Typical write latency is ~200 µs at 10 Gbps interconnect — fast enough for session creation (which happens once) but too slow for per-block lookup. Block lookups therefore use a local SRAM cache of the prefix registry, refreshed asynchronously from the coordinator. Cache staleness is bounded to 1 ms, which means a prefix created in the last millisecond may not be shared cluster-wide — an acceptable tradeoff for near-zero lookup latency.
Failure Modes & Correctness Risks
This section exists because every serious infrastructure proposal must enumerate its ways of breaking. These are not edge cases — they are the expected failure modes that a production deployment will encounter.
| Failure Mode | Trigger | Impact | Mitigation |
|---|---|---|---|
| Stale prefix block reuse | Hash collision or cache staleness causes wrong prefix to be served | Silent inference corruption — model generates from wrong context | Full prefix hash verification on first use per session; ECC-protected SRAM; collision count telemetry |
| DMA fence timeout | NVMe read takes >5 ms; fabric congestion; ASIC internal deadlock | Decode stall; timeout escalates to CPU; session may be killed | Watchdog timer per DMA channel; fallback to CPU-managed copy; circuit breaker on ASIC timeout rate |
| Metadata coordinator partition | Network split between coordinator replicas | Prefix sharing degrades to per-chip; no cross-ASIC reuse | Raft with split-brain prevention; ASICs continue with local-only dedup; graceful degradation, not crash |
| Compression quality regression | Model update changes head volatility profile; INT4 applied to now-high-volatility heads | Perplexity degradation; potentially silent (no hard error) | Per-model codec profile registered at session create; profile version checked on block decompress |
| Eviction of decode-active block | Bug in urgency score computation; race between score update and eviction decision | Decode stall or fetch of zero block; visible output corruption | Hardware lock on DECODE-state session; eviction engine cannot mark protected blocks; verified in RTL |
| HBM ECC failure | Cosmic ray or wear-out on on-die HBM | Single-bit: corrected silently. Multi-bit: session killed, block invalidated | SECDED ECC on all HBM; multi-bit error triggers session fence and re-prefill from cold storage |
| CXL link reset | PCIe/CXL link retraining during active decode | All in-flight DMA aborted; warm tier unavailable for ~50 ms retrain window | In-flight DMA list persisted to SRAM; re-issued on link restore; hot tier expands to absorb during retrain |
Silicon Roadmap: KV-1 → KV-3
A single chip generation is not a product. The roadmap matters for the investment case. Here is a plausible three-generation arc, assuming a 2-year cadence and stable process node progression:
| Generation | Process | New Capabilities | Target Workload |
|---|---|---|---|
| KV-1 (this chip) | 7nm | HBM3e hot tier · INT4/FP8 compression · prefix dedup · PCIe Gen5/CXL 3.0 | 128k context, 100+ concurrent sessions, single node |
| KV-2 | 4nm / 3nm | On-die ML-based eviction predictor · learned compression codecs · cluster-wide prefix fabric · NVLink 5 support · 2× DMA channels | 1M+ token context windows · multi-node prefix sharing · agentic memory persistence |
| KV-3 | 2nm / 1.8nm | In-memory attention scoring for prefix reuse decisions · KV-native compression at FP4 · cross-datacenter session handoff · hardware-enforced privacy domains | Persistent agent memory at datacenter scale · multi-tenant isolation · cross-region session continuity |
The most significant jump is KV-2's on-die ML predictor for eviction and prefetch. Instead of a fixed weighted-logit scorer, KV-2 would run a small (~1M parameter) learned model that predicts decode urgency from session features — improving prefetch accuracy from ~75% to an estimated ~92% hit rate on realistic workload mixes. This predictor is updated offline from telemetry traces and loaded into on-die SRAM at boot.
KV-3's hardware privacy domains are an enterprise requirement: in multi-tenant serving, one tenant's KV state must be cryptographically isolated from another's. At software layer this is expensive; in hardware it reduces to address translation with per-domain keys, which is a solved problem in security processors.
What the ASIC Does Not Replace
It is worth being explicit about scope, because infrastructure bets have a way of expanding past their original boundary.
What Comes Next
Part 4 turns to the software stack: the full KV semantic API specification, the session metadata protocol between runtimes and the ASIC driver, and the changes needed in vLLM and SGLang to consume a hardware KV layer without degrading their existing scheduling correctness.
The silicon described here is the physical contract. The protocol above it is the programming contract. Getting the protocol right is harder — because it requires consensus across the inference runtime ecosystem, not just a correct RTL implementation.