← All posts
M
Synthesis Essay · Systems Architecture

The Whole Stack:
Everything Between a Transistor and a Token

Published · 17 min read

Eighty-plus essays on AI infrastructure eventually converge on one insight: every performance number, every cost figure, every architectural decision is a consequence of a smaller, more physical decision made lower in the stack. This essay connects all of them — from process nodes to token pricing — and shows where the binding constraint actually lives in 2026.

M
Manish KL
AI Infrastructure · Memory Systems · Agent Architecture
· 22 min read

Why a stack perspective?

Most AI infrastructure writing treats layers of the stack as independent subjects. A post about HBM bandwidth. A post about vLLM scheduling. A post about MoE routing. These are all useful — but they create a fragmented picture where the connections between layers are invisible.

The connections are where the insight lives. The reason your inference is slow at long context is not a software scheduling problem — it is a consequence of HBM bandwidth physics, which is itself a consequence of TSV density in HBM packaging, which is itself constrained by the 40 µm micro-bump pitch of the silicon interposer. The reason quantization works is not "fewer bits, less memory" — it is that moving smaller bytes across the HBM interface consumes fewer joules per bit, which matters because 55% of GPU power during inference is data movement, not arithmetic.

This essay traces a single token from transistor physics to the dollar figure on your invoice, stopping at each layer to show the constraint that layer imposes on the one above it. By the end, the binding constraint in 2026 should be unmistakable — and it is not where most people think it is.

L1

Silicon: Where Everything Physical Is Decided

The conversation about AI hardware almost always starts at the GPU. It should start three levels lower: at the transistor and the process node. Every number in every GPU spec sheet — TFLOP/s, bandwidth, power — is downstream of decisions made at the fab.

What "3nm" actually means

TSMC's N3 process node is not 3 nanometres in any physical dimension — the actual gate length is approximately 12 nm. The node name is a marketing label for a generation of process improvements: tighter metal pitch, reduced cell height, and (at N2) a new transistor architecture called Gate-All-Around (GAA) that wraps the gate around all four sides of the channel instead of three.

What each node shrink actually delivers for AI chips: roughly 1.9× transistor density (N4 → N2), 15–20% power reduction per operation, and 5–10% frequency improvement at same power. The density improvement is the one that matters most — it translates directly into more tensor cores, more on-chip SRAM, or both, within the same die area. The H100 runs on TSMC 4N (a custom variant between N4 and N3). Its successor generations will run on N3P and N2, gaining approximately 1.9× more transistors per mm² with each jump.

The reticle limit: why chiplets exist

The maximum die size is constrained by the lithography reticle — approximately 858 mm² for TSMC's process. The H100 die is 814 mm², already at the practical limit. Getting more transistors means either waiting for the next process node (1.9× density) or going multi-chip. This is why chiplet packaging — connecting multiple smaller dies through a silicon interposer or advanced packaging — has become architecturally necessary, not just economically attractive. The H100 itself already uses a 2.5D CoWoS interposer to connect the GPU die to eight separate HBM3e stacks.

Moore's Law is slowing — what it costs AI

The historical 2× density doubling every 2 years has stretched to approximately 2× every 3–4 years. Each node transition also costs more: a leading-edge fab (N3 or better) now costs $20B+ to build, versus $1B for a 130nm fab in 2000. This cost increase flows through to chip pricing, which flows through to server cost, which flows through to inference cost per token. The free performance lunch from silicon scaling is over. Future gains come from packaging (more chiplets, better interposers), architecture (better use of existing transistors), and algorithms (more efficient models).

L2

Memory: The Real Constraint

The GPU's memory system — not its arithmetic — is the binding constraint for AI inference. This is the single most important structural fact in the field, and it is routinely underappreciated because GPU marketing leads with TFLOP/s numbers.

HBM: what it is physically

High Bandwidth Memory is not a faster version of DDR. It is a fundamentally different packaging architecture: multiple DRAM dies stacked vertically and connected through approximately 10,000 Through-Silicon Vias (TSVs) per die, placed millimetres from the GPU die on a silicon interposer. This eliminates the slow PCB trace path of DDR and replaces it with a 1,024-bit-wide parallel bus through micron-diameter copper conductors.

HBM3e in the H200 achieves 3.35 TB/s — 33× the bandwidth of DDR5. The energy cost is 3.9 pJ/bit vs 15 pJ/bit for DDR5. This bandwidth and efficiency advantage is why HBM is the only viable memory for AI inference despite costing $10–15/GB vs $3–5/GB for DDR5.

Why inference is 99% memory-wait

During autoregressive decode — which dominates real inference latency — the GPU performs a matrix-vector multiply (not matrix-matrix). The arithmetic intensity is approximately 1.6 FLOP/byte. The H200's hardware ridge point is ~148 FLOP/byte. Decode operates at 1/90th the intensity needed to keep compute busy.

70B model, FP8, decode step:
FLOPs required: ~118 GFLOPs
Bytes read from HBM: ~72 GB (weights + KV cache)
Compute time: 0.12 ms @ 989 TFLOP/s
Memory time: 21.5 ms @ 3.35 TB/s
→ GPU is waiting for memory 99.4% of each decode step

The idle GPU is not a bug or a scheduling failure. It is a structural consequence of memory physics. The tensor cores finish their work in 0.12 ms and then have nothing to do for 21 ms while HBM delivers the next set of weights. This is why every technique that reduces bytes read per token — quantization, weight sharing, KV compression — improves decode throughput regardless of arithmetic speed.

The memory hierarchy: five tiers, five physics regimes

Every layer of the hierarchy has fundamentally different physics:

Tier Technology Bandwidth Latency Capacity $/GB
On-chip SRAMRegister file + L1/L210–50 TB/s0.5–2 ns50–500 MB$5,000+
GPU HBM3eStacked DRAM + TSV3.35 TB/s~30 ns80–141 GB$10–15
DPU DDR5DRAM (host / DPU)~48 GB/s (P2P)~80 ns32–512 GB$3–5
CXL MemoryCXL 3.0-attached DRAM~32–64 GB/s~300–850 ns1–11 TB$2–4
NVMe SSDNAND flash~12–28 GB/s~100 µsTBs$0.08–0.15

Moving data one tier outward costs roughly 10× in latency and 10× in energy per bit. Every tier boundary crossing is a tax. The architecture of efficient inference is the architecture of minimising tier boundary crossings for the hottest data.

L3

The GPU: 132 SMs and What They Actually Do

The GPU is described endlessly in marketing but rarely explained mechanically. Understanding what is actually happening inside an SM — and why it is happening that way — makes every inference optimisation immediately comprehensible.

The SM: a parallel processor optimised for throughput, not latency

A CPU core devotes most of its transistor budget to managing complexity: out-of-order buffers, branch predictors, deep caches. A GPU SM makes the opposite trade: almost no complexity management, but an enormous number of simple arithmetic units and a warp-based threading model designed to hide memory latency through parallelism.

When an H100 SM's warp issues a memory load that takes 700 cycles to return from HBM, the warp scheduler immediately selects another of the 64 resident warps that has all its operands ready. Warp switching costs zero cycles — the register file holds all 64 warps' state simultaneously in its 256 KB, so there is no save/restore. By the time the scheduler has cycled through all ready warps, the original load has returned. The SM never actually stalls — this is latency hiding through thread-level parallelism, and it is why GPUs tolerate the high HBM latency that would paralyse a CPU.

Tensor cores: 8,192 ops per cycle

A CUDA core performs one FP32 multiply-add per cycle. A 4th-generation Hopper tensor core performs D = A×B + C where A, B, C, D are 16×16 matrix tiles — 8,192 FP8 multiply-add operations in a single clock cycle. The H100 has 528 tensor cores across its 132 SMs, giving 989 TFLOP/s FP8 peak vs ~67 TFLOP/s from CUDA cores alone.

This matters for prefill (compute-bound, uses tensor cores efficiently) but not for decode (memory-bound, tensor cores idle 99% of each step). The headline TFLOP/s number is almost entirely irrelevant for characterising inference performance at typical serving batch sizes.

The GPU's tensor cores are a remarkable engineering achievement — and they are nearly irrelevant during autoregressive decode. Prefill uses them well. Decode doesn't. And decode is where users wait.

L4

Inference: The Architecture Is Being Rebuilt

Inference architecture has undergone more fundamental change in two years than the previous decade. Three shifts define 2024–2026: prefill/decode disaggregation, KV cache tiering, and the adoption of architectures that make long context tractable.

Prefill and decode have opposite hardware needs

Prefill — processing the input prompt — is compute-bound. Arithmetic intensity is 100–1,000 FLOP/byte depending on sequence length. Tensor cores are fully utilised. Decode — generating output tokens — is memory-bandwidth-bound at ~1.6 FLOP/byte, with tensor cores idle 99% of the time.

Running both on the same GPU is a hardware compromise that satisfies neither. Disaggregated inference separates them into two pools: prefill nodes (compute-optimised, can use more expensive GPUs or future ASICs with more TFLOP/s) and decode nodes (bandwidth-optimised, want maximum HBM capacity and bandwidth). The KV cache is transferred from prefill to decode over NVLink or RDMA after prefill completes — once per request, adding ~88ms for a 70B model at 4K context over 400G Ethernet.

KV cache: the hidden scaling problem

For a 70B FP8 model, every token of context requires approximately 1.25 MB of KV cache storage. A single 128K-context session occupies ~160 GB — exceeding the H200's 141 GB HBM. At 500 concurrent sessions, total KV state reaches 80 TB.

The solution is tiered KV management: hot KV in HBM, warm KV in host DRAM or DPU buffer, cold KV in CXL-attached memory. The software that manages these tiers — deciding which pages are hot, when to prefetch, when to evict — is increasingly the critical path. A regret-aware eviction policy that tracks page access history and predicts future reuse probability can reduce effective KV miss rate by 30–40% vs LRU, directly improving throughput without any hardware changes.

Speculative decoding: using compute to buy memory time

Speculative decoding uses a small draft model to generate 4–8 candidate tokens, then verifies them all in a single forward pass of the large model. If the draft matches, all tokens are accepted at the cost of one forward pass. The gain: converting memory-bound sequential decode into occasionally compute-bound batch verification. The catch: draft and verification models must share KV state efficiently, and rollback on speculation failure creates HBM fragmentation that must be managed carefully to avoid throughput degradation.

L5

Agents: The Infrastructure Problem Nobody Saw Coming

Every piece of inference infrastructure described above was designed for one model, one request. The production AI workload in 2026 is a directed graph of model calls — orchestrators calling planners calling retrievers calling verifiers. This topology change invalidates most of the single-model optimisation assumptions.

The CPU becomes the bottleneck

In a single-model inference system, the GPU is the bottleneck and the CPU is barely used. In an agent pipeline, the CPU must handle: tool response ingestion, JSON validation, context assembly across hops, routing logic, state management for hundreds of concurrent sessions, and NCCL/RDMA orchestration between model calls. CPU utilisation in production agent systems commonly reaches 70–80% while GPU utilisation falls to 20–40%.

The DPU (Data Processing Unit) as Agent Memory Controller is the architectural answer: offload JSON validation, KV prefetch decisions, and GPU DMA programming to the DPU's ARM cores and hardware regex engines, removing the host CPU from the critical path. A BlueField-3 DPU can validate structured tool responses in hardware at 30M+ operations/second using pre-compiled regex automata — at 0.8W vs 15W per CPU core for software validation.

The KV locality problem at pipeline scope

KV caches do not persist across model boundaries. When an orchestrator's output becomes a planner's input, the planner starts a completely new KV cache — the orchestrator's internal representations cannot be transferred. Every edge in the agent call graph is a KV locality reset and a full re-prefill of the downstream model. For a 7-hop agent pipeline where the RAG retrieval hop injects 256K tokens of context, the cumulative prefill cost is approximately 158× the cost of the original user request. Topology-aware routing — scheduling model calls to GPU nodes where the relevant KV prefixes are already warm — can recover 40–60% of this overhead.

L6

Economics: What a Token Actually Costs

The $/1M-token prices quoted by labs are not costs — they are prices. The actual infrastructure cost of an output token is derivable from first principles, and understanding the decomposition reveals which layers of the stack are most worth optimising.

The five-layer cost stack

Cost Layer $/1M tokens (H200, 70B FP8, B=64) % of total Primary lever
Capital (GPU depreciation)$60.635%GPU utilisation rate, hardware generation
GPU power$48.728%Quantization (FP16→FP8 saves ~18%), power location
Cooling (facility PUE)$17.410%PUE (DLC 1.10 vs air-cooled 1.35)
NAND / storage$13.98%KV offload architecture, checkpoint strategy
Networking (NVLink, ToR)$10.46%Topology, prefill/decode disaggregation overhead
Labour + software$22.713%Automation, team efficiency

Baseline: $174/1M output tokens. This is best-case (FP8, batch 64, 4K context, 65% utilisation, $0.07/kWh). Real production blended cost is 2–5× higher.

Context length is a superlinear cost multiplier

At 128K context, the KV cache alone exceeds H200's 141 GB HBM, requiring multi-GPU KV spanning or NVMe offload. Batch capacity drops from ~32 requests to 1–2. Cost per output token at 128K context is approximately 8.2× the cost at 4K context, for identical output length. At 1M context, the multiplier reaches ~56×.

This is why context-length-based pricing is economically necessary, not arbitrary. A lab charging flat $/1M output tokens regardless of context is cross-subsidising long-context requests with short-context revenue. As agentic workloads push average context length upward, this cross-subsidy becomes unsustainable.

The model efficiency insight

DeepSeek-V3 at 671B total / 37B active costs approximately the same to serve as a 40B dense model, with significantly greater representational capacity. A 60% reduction in active parameters — through better MoE design — reduces per-token inference cost by approximately 60%, independent of any hardware improvements. Model architecture is the dominant long-term cost lever, not hardware. Every new GPU generation delivers 1.5–2× throughput improvement. A better architecture can deliver 4–10× cost reduction in a single model release.

Where the Binding Constraint Lives in 2026

Tracing the stack from transistor to token, the binding constraint in each layer of the system is now clear:

1
Silicon layer: lithography and reticle limits

Slowing Moore's Law means the free density doubling every 2 years is gone. Die area is capped at ~800 mm². Future gains require chiplets and better packaging — CoWoS, EMIB, SoIC — not just node shrinks.

2
Memory layer: HBM bandwidth and KV capacity

3.35 TB/s is not enough for long-context decode. HBM4 at ~2 TB/s per stack (2026–2027) will help but not solve it. The deeper answer is reducing bytes-per-token through quantization, sparsity, and KV compression — not buying more bandwidth.

3
GPU layer: decode is memory-bound by physics

Tensor core throughput is irrelevant during decode. SRAM-first architectures (wafer-scale, near-memory compute) are architecturally motivated but economically premature. The current answer is batch-level parallelism and efficient memory orchestration.

4
Inference layer: orchestration overhead, not model math

For agentic workloads, the binding constraint has migrated from GPU HBM bandwidth to CPU orchestration overhead and KV tier management. The memory scheduler — not the model — is the new critical path.

5
Agent layer: topology-naive infrastructure

Agent pipelines are being served by single-model infrastructure that has no concept of pipeline topology. KV locality is lost at every model boundary. End-to-end latency is the sum of per-hop p99s, not the maximum. This is year zero for agent-aware serving infrastructure.

6
Economics layer: model efficiency, not hardware

The biggest cost lever available today is model architecture. A 4× reduction in active parameters through better MoE design reduces serving cost by 4× regardless of what GPU generation you are on. Hardware scaling delivers 1.5–2× per generation. Architecture delivers 4–10×.

The 2026 binding constraint, stated plainly:

For most AI infrastructure teams, the binding constraint is not arithmetic throughput — it has not been arithmetic throughput since tensor cores arrived. It is not HBM bandwidth — though that is second. It is the software stack that manages data movement between memory tiers: which KV pages are hot, where they live, how they move, and how long it takes to get the right bytes to the right compute unit at the right time. The teams winning at inference economics in 2026 are the ones who have built memory schedulers that understand their workload, not the ones who bought the most H200s.

The Takeaway

You cannot optimise what you cannot trace. And you cannot trace a token's cost without understanding every layer it traverses from TSV to TCP.

The H100 runs on a process node with 198M transistors/mm², built on a silicon interposer connecting eight HBM3e stacks through 80,000+ TSVs. The GPU's 132 SMs hide HBM latency through warp scheduling and perform matrix tile arithmetic in tensor cores. The inference stack converts matrix multiplies into tokens through a KV-cache-managed attention mechanism that is memory-bandwidth-limited 99% of the time. The agent stack adds orchestration overhead that the GPU was never designed to absorb. And at the top, economics converts joules and amortised silicon into a dollar figure per million tokens.

Every performance problem in this stack has a physical root cause two or three layers below where it manifests. Every optimisation opportunity is a constraint imposed by a lower layer that a higher layer has failed to exploit. The engineers who understand all six layers simultaneously are the ones who find the real leverage points — and in 2026, the real leverage is in the memory scheduler, not the model.

This essay draws from the author's series of infrastructure essays covering: process nodes and transistor physics, HBM architecture and TSV packaging, GPU microarchitecture (SM, warps, tensor cores), KV cache management and eviction policy, prefill/decode disaggregation, speculative decoding, agent topology and the DPU-as-AMC architecture, token cost modelling, and MoE inference kernel design. The full series is available at manishklach.github.io/writings.