MAN\SH AI / Writings

· AI Memory Infrastructure · 18 min read

Part 3 · KV ASIC Architecture · Deep Dive

Inside the
KV ASIC:
Silicon, Schematics
& Performance

Parts 1 and 2 made the case for dedicated KV infrastructure. Part 3 opens the hood — what the chip actually looks like, how its internal blocks connect, and what performance gains are plausible.

Deep Technical · ~18 min read · Architecture Series

Dedicated silicon is not an inevitable step — it's a specific bet. The bet is that the KV workload is stable enough to justify fixed-function hardware, and distinct enough from general compute that a GPU or CPU can't serve it efficiently. Before that bet is made, you need to know exactly what the chip is doing.

This piece walks through the KV ASIC block by block: the die floorplan, the internal data paths, the schematic for each functional unit, and a grounded estimate of the performance uplift against a baseline software-only stack.

The Die Floorplan

Click any block in the diagram below to see what it does and why it's where it is. The floorplan is designed around one constraint: DMA engines must be adjacent to both the fabric interface and the memory controllers, because every KV transfer must pass through them.

KV Infrastructure ASIC — Interactive Floorplan · Click any block
KV ASIC Die Floorplan Interactive floorplan showing the functional blocks of a KV Infrastructure ASIC. Click blocks to learn more. KV-1 INFRASTRUCTURE ASIC · 7nm · ~180mm² PCIe Gen5 / CXL 3.0 / NVLink 4 Front End 256 lanes · ~1.6 TB/s aggregate bandwidth · Hardware credit flow control DMA Engines ×32 channels Session Metadata SRAM 256 MB on-die Prefix Dedup Engine SimHash + ref-count Prefetch Scheduler urgency model KV Compression / Decompression Array INT4 / FP8 quantisation · Head-aware codec · ×16 parallel units · ~1.2 TB/s throughput QoS + Isolation per-session SLAs · priority queuing Eviction Policy decode-urgency-aware Telemetry HW perf counters · tracing Memory Controllers HBM3e · DDR5 · CXL 3.0 attached pool · NVMe P2P DMA INTERNAL CROSSBAR 512-bit
↑ Click any block above to explore its function

Block-Level Schematics

Each functional block has a distinct internal architecture. The tabs below walk through the four most architecturally interesting units.

DMA Engine
Prefix Dedup
Compression
Eviction
DMA Engine schematic Internal block diagram of the KV ASIC DMA Engine showing channel arbiter, scatter-gather, and memory interface DMA ENGINE ×32 CHANNELS Channel Arbiter priority queue Scatter-Gather Engine KV page map Credit Flow Control back-pressure Completion Queue Memory Read/Write Bus · 512-bit @ 4 GHz Fabric Interface Bus · PCIe/CXL TLP Target: 32 simultaneous KV block transfers · sub-500 ns initiation latency per channel
Prefix dedup engine schematic Internal block diagram of the prefix deduplication engine showing hash pipeline, ref-count store and pointer indirection PREFIX DEDUPLICATION ENGINE Token Ingress boundary detect SimHash Pipeline 64-bit fingerprint Ref-Count Store SRAM hash table Pointer Indirection copy-on-write HIT → return pointer MISS → allocate block Typical system prompt hit rate in production: 70–95% · Each hit avoids full HBM allocation
KV compression unit schematic Shows the per-head codec pipeline with INT4 quantisation and head-volatility routing KV COMPRESSION ARRAY ×16 UNITS Head Volatility Classifier per-layer profile Codec Router INT4 Quantiser low-volatility heads FP8 Quantiser high-volatility heads Output Merge INT4 compression ratio: ~4–8× perplexity delta: <0.5% on most tasks (task-specific; characterise empirically) FP8 compression ratio: ~2× preserves sensitive attention heads used where quality matters most
Eviction policy engine schematic Decode-urgency-aware eviction policy showing urgency scoring, priority queue and tier routing EVICTION POLICY ENGINE Score Inputs decode_urgency reuse_potential prefix_refcount last_access_age session_priority qos_tier Composite Scorer weighted logit Priority Heap min-heap 256k entries Tier Router KEEP in HBM DEMOTE → CXL ARCHIVE → NVMe EVICT (free) PRESERVE prefix Key invariant: a session in active decode is NEVER evicted · Prefix blocks are NEVER deleted while refcount > 0

Performance Projections

These numbers are modelled estimates based on published hardware specs, comparable workloads, and conservative architectural assumptions. They are not silicon measurements — treat them as the upper envelope of what well-engineered dedicated hardware could achieve against a software-only baseline running on a CPU-managed DRAM hierarchy.

Baseline Definition
Baseline = vLLM with PagedAttention on 8× H100 SXM, KV managed in host DRAM via CPU, PCIe Gen4 interconnect. No prefix dedup. No hardware compression. Eviction via LRU.
Decode throughput gain
2.4–3.1×
vs. baseline at 128k context, 100 sessions. Primary driver: prefix dedup eliminates ~70% of redundant HBM allocations.
TTFT (time-to-first-token)
−58%
Prefetch scheduler hides CXL read latency behind prompt processing. Cold-start penalty drops from ~120 ms to ~50 ms median.
HBM footprint reduction
4–8×
INT4 compression on low-volatility heads. Allows 4–8× more concurrent sessions per GPU node before HBM spill.
KV movement energy
−70%
Compression reduces bytes transferred. Dedup avoids transfers entirely for shared prefixes. DMA coalescing reduces PCIe overhead.

Throughput Comparison by Workload

Estimated tokens/second per GPU node, relative to software baseline (1.0×). KV ASIC offload assumes the chip sits on a PCIe Gen5 slot adjacent to each GPU.

Short ctx, low concurrency
Baseline
1.3×
32k ctx, 100 sessions
Baseline
1.9×
128k ctx, 100 sessions
Baseline
3.0×
Agentic, shared prefixes
Baseline
>3.5×
Blue = estimated with KV ASIC offload  ·  Gain amplifies with context length and prefix reuse rate

Approximate Power Budget

Estimated TDP for a production KV ASIC at 7nm. Comparison point: H100 SXM TDP ≈ 700W.

DMA Engines (×32)
35W
35W
Compression array (×16)
40W
40W
Memory controllers
25W
25W
Metadata SRAM + dedup
20W
20W
Fabric interface + misc
18W
18W
Estimated total TDP ~138W
~20% of H100 power · enables 1 KV ASIC per 4 GPUs in standard rack power budgets

Comparison: ASIC vs. Software-Only vs. CXL-Only

Capability Software Only CXL Expansion KV ASIC
Prefix deduplication SW, high CPU cost None HW, <100ns
KV compression SW, latency hit None HW, inline, 1.2 TB/s
Eviction policy LRU / custom SW Generic LRU Decode-urgency-aware
Prefetch scheduling None / manual None Urgency model, HW
Topology awareness None Partial (CXL only) Full hierarchy
Session isolation / QoS SW, coarse None Per-session SLA
Observability SW tracing None HW perf counters
Complexity / risk Low Low High (new silicon)

The gains are real — but they compound only when the workload demands them. At short context and low concurrency, the ASIC is an expensive NIC. At 128k context and high concurrency, it may be the difference between viable and not.

— architectural tradeoff summary

Memory Timing Deep Dive

Every claim about latency hides a ladder of sub-operations. Here is what actually happens, in order, when a GPU decode step requests a KV block that has been demoted to CXL:

KV Block Retrieval — Full Timing Breakdown (CXL warm tier)
KV block retrieval timing breakdown from CXL warm tier Waterfall diagram showing the sequence of operations and their latencies when fetching a KV block from CXL warm tier OPERATION SEQUENCE 0 µs 1 µs 2 µs 3 µs 1. GPU issues TLP miss ~100 ns 2. ASIC metadata SRAM lookup ~50 ns 3. DMA channel arbitration ~30 ns 4. CXL 3.0 read request issued ~50 ns 5. CXL fabric traversal ~800 ns (round trip) 6. CXL-attached DDR5 read ~500 ns 7. Inline decompression (INT4→FP16) ~200 ns 8. DMA write to GPU HBM ~400 ns 9. Completion interrupt to GPU ~80 ns Total (cold): ~2.2 µs end-to-end ← prefetch hides this ~1.6 µs behind compute

Two microseconds is a long time at GPU speeds — an H100 decodes roughly 10,000 FLOPs in that window. The prefetch scheduler's job is to ensure that the CXL read (steps 5–6, the longest segment) is in flight before the GPU finishes the preceding decode step. When it succeeds, the effective latency seen by the GPU is just the final DMA write: ~400 ns. That is the difference between a stall and transparency.

The NVMe cold tier is worse by roughly 2 orders of magnitude — a full NVMe read adds ~100 µs. This is why archived sessions must be explicitly warmed into CXL before a session resumes, not fetched on-demand. The prefetch scheduler handles this: when a session is scheduled to resume, a warm-up prefetch is issued 500 µs in advance.

The Programming Model

Hardware is only as useful as the software contract above it. The KV ASIC exposes four surfaces to the inference stack:

KV ASIC Programming Model — Interface Layers
KV ASIC programming model showing four interface layers Four stacked layers: kernel driver, session API, block API, and telemetry API, with their key operations Inference Runtime vLLM · SGLang · TGI · custom issues: kv_session_begin / kv_block_fetch / kv_session_end speaks KV semantic API — knows nothing about tiers KV Session API (userspace lib) libkvcache.so · CUDA extension session_create / prefix_register / block_hint mmap'd control registers · zero-copy hints KV ASIC Kernel Driver kv_asic.ko (Linux) · PCIe BAR mapping IOMMU setup · interrupt routing · MMIO exposes /dev/kvasic0 · perf counter fd KV-1 ASIC Hardware PCIe BAR0 (control) · BAR2 (telemetry) · P2P DMA MMIO registers · doorbell pages · MSI-X 256 interrupt vectors · 1 per DMA channel KEY OPS ↓ kv_session_begin(model_id, qos_tier) → session_handle kv_block_fetch(session, layer, head_range, token_span) → dma_fence kv_prefix_register(hash, ptr, len) → prefix_id kv_session_end(handle, archive_policy)

The key design decision in the API is that the inference runtime speaks in semantics, not addresses. It calls kv_block_fetch(session, layer, head_range, token_span) — and the ASIC resolves that semantic request to a physical address, selects the tier, initiates the DMA, applies decompression if needed, and returns a fence. The runtime never knows or cares whether the block came from HBM, CXL, or NVMe.

This separation is load-bearing. It means the eviction policy, compression strategy, and tier placement can all be changed — or learned — without modifying the inference runtime. The ASIC is the policy engine; the runtime is the policy consumer.

Session Lifecycle

Every KV session passes through five states. The ASIC enforces valid transitions in hardware — an invalid transition (e.g. fetching a block from an ARCHIVED session without first issuing a warm-up) returns an error code and a DMA fence that resolves to a zero block, preventing silent corruption.

Session State Machine
KV session state machine Five states: INIT, PREFILL, DECODE, SUSPENDED, ARCHIVED with transition arrows INIT allocating PREFILL HBM hot DECODE protected SUSPENDED CXL/DDR ARCHIVED NVMe/obj resume (warm-up prefetch) DECODE state → eviction protected in hardware · invalid transitions → error fence, zero block returned Prefix blocks shared across sessions → ref-counted · freed only when refcount reaches zero

Signal Integrity & Physical Constraints

At 7nm with PCIe Gen5 signalling, the chip's physical design has non-trivial constraints. These aren't theoretical — they determine what the floorplan can and cannot look like.

Interface Signal rate Constraint Mitigation
PCIe Gen5 32 GT/s per lane Trace length matching ±5 mil, EQ training on link-up On-die retimer, adaptive equaliser
CXL 3.0 32 GT/s (shared PHY with PCIe) Protocol negotiation adds ~10 µs at link-up; cache coherency protocol overhead Dedicated CXL arbiter; FLIT-mode to reduce protocol overhead by ~40%
HBM3e ~3.35 TB/s per stack Must be within 2mm of die for signal integrity; bump pitch 55 µm 2.5D interposer (TSMC CoWoS or equivalent); up to 4 HBM stacks
DDR5 ~64 GB/s per channel Maximum trace length 5 inches at 6400 MT/s; power delivery noise On-package decoupling; up to 8 channels (512 GB/s aggregate)
NVMe P2P DMA PCIe Gen4×4 per SSD P2P DMA bypasses host CPU/DRAM; requires IOMMU passthrough BAR-to-BAR DMA; DMA completion interrupt via MSI-X

The HBM constraint is the most significant. Placing HBM stacks on a 2.5D interposer (CoWoS-style) adds ~$800–1,200 per unit to packaging cost, but it is the only practical way to achieve the ~3 TB/s bandwidth the active decode tier needs. Without it, the chip's hot tier degrades to DDR5 speeds (~500 GB/s), which is insufficient for high-concurrency long-context serving.

An alternative design — using HBM only on the GPU and serving the ASIC exclusively from CXL — is viable at lower concurrency but collapses under 128k-context loads where the GPU's HBM fills entirely with active decode state and the ASIC has nowhere to stage warm blocks fast enough.

Multi-Chip Topology

A single KV ASIC can serve 2–4 GPUs depending on context length and concurrency. In a large cluster, multiple ASICs form a fabric — and the KV metadata must be coordinated across them. Here is the reference cluster topology for a 32-GPU node:

32-GPU Node — KV ASIC Cluster Topology
32-GPU node KV ASIC cluster topology Eight KV ASICs each serving four GPUs, connected to a shared CXL memory pool and NVMe tier, with a metadata coordinator KV-ASIC 1 GPU 1 GPU 2 GPU 3 GPU 4 KV-ASIC 2 GPU 5 GPU 6 GPU 7 GPU 8 KV-ASIC 3 GPU 9 GPU 10 GPU 11 GPU 12 KV-ASIC 4 GPU 13 GPU 14 GPU 15 GPU 16 KV-ASIC 5 GPU 17 GPU 18 GPU 19 GPU 20 KV-ASIC 6 GPU 21 GPU 22 GPU 23 GPU 24 KV-ASIC 7 GPU 25 GPU 26 GPU 27 GPU 28 KV-ASIC 8 GPU 29 GPU 30 GPU 31 GPU 32 Shared CXL Pool 8 TB DDR5 ~800 GB/s agg. prefix pool shared across all ASICs NVMe / Object Tier persistent sessions · 128+ TB · P2P DMA from any ASIC Metadata Coordinator cross-ASIC prefix registry · Raft consensus

The metadata coordinator is the most architecturally sensitive component in the cluster. It maintains a distributed registry of all live prefix blocks so that a session on ASIC 3 can reuse a prefix block that was first created by ASIC 7. Without it, prefix deduplication is local-only and the gain degrades from cluster-wide to per-chip.

The coordinator uses Raft consensus with a 3-replica write quorum. Typical write latency is ~200 µs at 10 Gbps interconnect — fast enough for session creation (which happens once) but too slow for per-block lookup. Block lookups therefore use a local SRAM cache of the prefix registry, refreshed asynchronously from the coordinator. Cache staleness is bounded to 1 ms, which means a prefix created in the last millisecond may not be shared cluster-wide — an acceptable tradeoff for near-zero lookup latency.

Failure Modes & Correctness Risks

This section exists because every serious infrastructure proposal must enumerate its ways of breaking. These are not edge cases — they are the expected failure modes that a production deployment will encounter.

Failure Mode Trigger Impact Mitigation
Stale prefix block reuse Hash collision or cache staleness causes wrong prefix to be served Silent inference corruption — model generates from wrong context Full prefix hash verification on first use per session; ECC-protected SRAM; collision count telemetry
DMA fence timeout NVMe read takes >5 ms; fabric congestion; ASIC internal deadlock Decode stall; timeout escalates to CPU; session may be killed Watchdog timer per DMA channel; fallback to CPU-managed copy; circuit breaker on ASIC timeout rate
Metadata coordinator partition Network split between coordinator replicas Prefix sharing degrades to per-chip; no cross-ASIC reuse Raft with split-brain prevention; ASICs continue with local-only dedup; graceful degradation, not crash
Compression quality regression Model update changes head volatility profile; INT4 applied to now-high-volatility heads Perplexity degradation; potentially silent (no hard error) Per-model codec profile registered at session create; profile version checked on block decompress
Eviction of decode-active block Bug in urgency score computation; race between score update and eviction decision Decode stall or fetch of zero block; visible output corruption Hardware lock on DECODE-state session; eviction engine cannot mark protected blocks; verified in RTL
HBM ECC failure Cosmic ray or wear-out on on-die HBM Single-bit: corrected silently. Multi-bit: session killed, block invalidated SECDED ECC on all HBM; multi-bit error triggers session fence and re-prefill from cold storage
CXL link reset PCIe/CXL link retraining during active decode All in-flight DMA aborted; warm tier unavailable for ~50 ms retrain window In-flight DMA list persisted to SRAM; re-issued on link restore; hot tier expands to absorb during retrain
Engineering Priority
The two silent failure modes — stale prefix reuse and compression quality regression — are the hardest to detect in production because they produce wrong output without raising errors. Both require explicit detection logic: prefix hash verification at the session boundary, and per-model codec profile pinning. Neither is expensive to implement; both are easy to omit in a first implementation.

Silicon Roadmap: KV-1 → KV-3

A single chip generation is not a product. The roadmap matters for the investment case. Here is a plausible three-generation arc, assuming a 2-year cadence and stable process node progression:

Generation Process New Capabilities Target Workload
KV-1 (this chip) 7nm HBM3e hot tier · INT4/FP8 compression · prefix dedup · PCIe Gen5/CXL 3.0 128k context, 100+ concurrent sessions, single node
KV-2 4nm / 3nm On-die ML-based eviction predictor · learned compression codecs · cluster-wide prefix fabric · NVLink 5 support · 2× DMA channels 1M+ token context windows · multi-node prefix sharing · agentic memory persistence
KV-3 2nm / 1.8nm In-memory attention scoring for prefix reuse decisions · KV-native compression at FP4 · cross-datacenter session handoff · hardware-enforced privacy domains Persistent agent memory at datacenter scale · multi-tenant isolation · cross-region session continuity

The most significant jump is KV-2's on-die ML predictor for eviction and prefetch. Instead of a fixed weighted-logit scorer, KV-2 would run a small (~1M parameter) learned model that predicts decode urgency from session features — improving prefetch accuracy from ~75% to an estimated ~92% hit rate on realistic workload mixes. This predictor is updated offline from telemetry traces and loaded into on-die SRAM at boot.

KV-3's hardware privacy domains are an enterprise requirement: in multi-tenant serving, one tenant's KV state must be cryptographically isolated from another's. At software layer this is expensive; in hardware it reduces to address translation with per-domain keys, which is a solved problem in security processors.

What the ASIC Does Not Replace

It is worth being explicit about scope, because infrastructure bets have a way of expanding past their original boundary.

Not replaced
The inference runtime
vLLM, SGLang, TGI — these own scheduling, batching, and the forward pass. The ASIC is subordinate to the runtime, not a replacement for it.
Not replaced
The GPU accelerator
All attention computation, FFN, and sampling remain on GPU. The ASIC only handles KV memory — it never touches activations or model weights.
Not replaced
Model quality decisions
How much compression is acceptable is a model-specific empirical question. The ASIC provides the mechanism; the operator sets the policy.
Not replaced
Request routing
Which GPU processes which request is still decided by the orchestrator (Kubernetes, Dynamo, etc.). The ASIC improves the memory layer beneath that decision.

What Comes Next

Part 4 turns to the software stack: the full KV semantic API specification, the session metadata protocol between runtimes and the ASIC driver, and the changes needed in vLLM and SGLang to consume a hardware KV layer without degrading their existing scheduling correctness.

The silicon described here is the physical contract. The protocol above it is the programming contract. Getting the protocol right is harder — because it requires consensus across the inference runtime ecosystem, not just a correct RTL implementation.