A bottom-up cost model connecting HBM bandwidth, cooling kilowatts, NAND geometry, and interconnect physics to the number that actually matters: dollars per million output tokens.
The AI pricing war of 2025–2026 produced extraordinary numbers. GPT-4o output fell from $15/1M tokens to $10 to $2.50 in eighteen months. Gemini Flash came in at $0.075. Deepseek-V3 at $1.10. Claude Sonnet at $3. These prices are real — they are what developers pay — but they are not costs. Nobody outside the infrastructure teams at the major labs knows what a token actually costs to produce.
This essay attempts a rigorous approximation. Not a guess based on quarterly earnings, but a bottom-up model built from the physical constraints of the hardware: what computation a single output token requires, what memory movement that computation demands, what power that movement consumes, and what that power costs when you include the full facility stack from the transformer on the street to the die on the board.
The reason this calculation is rarely done is that it requires simultaneously understanding five domains: GPU microarchitecture, power systems engineering, memory hierarchy physics, networking infrastructure, and financial modelling. Each domain has specialists. Almost nobody combines them. This essay tries to.
The key insight this essay argues for: the cost of a token is dominated not by arithmetic but by data movement. Understanding where the bytes move, how many times, and what each crossing of a memory boundary costs is the core of inference economics.
Before we can price a token, we need to understand what producing one requires. At decode time — which is where most serving cost lives — generating one output token from a transformer model requires the following operations, in order:
For a dense transformer with L layers, hidden dimension d, and projection dimension d_ff, one decode step requires:
At H200 peak throughput of 989 TFLOPS (FP8 tensor core), a single decode step takes 0.12 milliseconds of pure compute. But decode steps don't run at peak TFLOPS. They are memory-bandwidth-limited. The actual bottleneck is reading model weights and the KV cache from HBM for every token generated.
For a decode step, the memory that must be read per token includes: all model weight matrices (read once per step), the KV cache for every token in the context (read for attention), and activation tensors between layers.
| Memory Object | Size (70B, FP8) | Reads per decode step | Notes |
|---|---|---|---|
| Model weights | 70 GB | 1× | All weight matrices, QKV projections, FFN |
| KV cache (2K context) | ~2.2 GB | 1× | Keys and values for all prior tokens, all layers |
| KV cache (32K context) | ~35 GB | 1× | Scales linearly with context length |
| KV cache (128K context) | ~139 GB | 1× | Exceeds H200 HBM capacity (141 GB); requires offload |
| Layer activations | ~0.1 GB | L× | Intermediate tensors between layers |
The arithmetic intensity of decode is therefore catastrophically low. For a 70B model serving a 2K-context request:
This is not a software failure. It is physics. Autoregressive generation produces one token per step. Each step requires reading the full weight set and the full KV cache. No amount of software optimization changes the fundamental ratio of work to data.
If decode is memory-bandwidth-limited, then the cost of a decode token is primarily the cost of reading bytes from HBM. This is not a metaphor. It is the dominant physical cost: every joule spent on a decode step is mostly a joule spent moving bytes across the HBM interface.
HBM energy per bit is approximately 3.9 pJ/bit (HBM3e, measured across the TSV and die interface at operating conditions). That is 31.2 pJ/byte, or 31.2 nJ/byte.
The arithmetic is stark: at single-request serving, a decode step takes 21.6 milliseconds of memory time and 0.12 milliseconds of compute time. The GPU is idle on arithmetic for more than 99% of the step. This is why batching matters so much — it amortises the weight read across multiple requests simultaneously — and why it is also the source of the most important cost tradeoffs in inference.
With batch size B, the weight bytes are read once but amortised across B tokens. The KV bytes scale with B × context_length. This gives a useful breakeven formula:
This explains one of the deepest truths in inference economics: batching is not an optimization — it is the economic foundation. Without batching, inference is prohibitively expensive. With aggressive batching, the weight-read cost is nearly eliminated, and the KV cache becomes the dominant cost. This is exactly why long-context requests are so much more expensive at the same output length — the KV term dominates the asymptote.
Power consumption in inference has two components: the IT load (what the GPUs, CPUs, NICs, and storage draw) and the facility overhead (cooling, power distribution, lighting, and management). The ratio is captured by PUE (Power Usage Effectiveness), which for a modern liquid-cooled AI facility runs between 1.07 and 1.15.
Measurements from production H200 SXM clusters in Q1 2026 show the following breakdown for a typical Llama-3-70B decode step at batch 64, context 4K:
| Component | % of GPU power | Watts (at 700W TDP) | Notes |
|---|---|---|---|
| SM compute (tensor cores) | 38% | 266 W | Matrix multiply, attention compute |
| HBM reads/writes | 31% | 217 W | Weights + KV cache bandwidth; largest single component |
| SRAM fill and spill | 9% | 63 W | L2 cache, scratchpad, register file |
| NVLink / PCIe links | 7% | 49 W | KV transfer in disaggregated setups; all-reduce in TP |
| Optical retimers / CPO | 8% | 56 W | 800G transceivers; CPO at NVL72 scale |
| Leakage + housekeeping | 7% | 49 W | Static leakage, clock distribution, I/O |
The headline: data movement (HBM + SRAM + links + optics) accounts for 55% of GPU power during decode. Arithmetic is 38%. This means that if you could somehow make arithmetic free but leave data movement unchanged, you would save 38% of GPU power. If you could make data movement free but leave arithmetic unchanged, you would save 55%.
Memory is the budget. Quantization — FP16 → FP8 → FP4 — reduces cost primarily by halving or quartering the bytes that cross HBM per weight read, not by making arithmetic cheaper per operation. This is why FP8 produces a ~33% power reduction rather than the 50% naive expectation: it halves weight bytes (saving ~15.5% of GPU power) plus reduces SM voltage and clock (saving ~9%), plus switches lower-power optical modes (~9%).
GPU power is the largest but not the only power cost in serving. A complete facility accounting must include:
| Facility Layer | Overhead (% of IT load) | Technology |
|---|---|---|
| Direct liquid cooling (DLC) | 4–6% | Pumps, heat exchangers, cooling towers; 32°C supply water |
| Power distribution (UPS, PDU, busbar) | 3–5% | UPS losses, busbar resistance, rectifier efficiency |
| Facility HVAC and lighting | 1–2% | Hot aisle/cold aisle for residual air, lighting |
| Network switching (ToR, spine) | 2–4% | 400G/800G Ethernet or InfiniBand switches |
| Storage (NVMe, networked flash) | 1–3% | E1.S/E3.S drives for KV offload, checkpoint, weights |
| CPU hosts, BMCs, management | 2–4% | ARM/x86 hosts, baseboard management, telemetry |
For a liquid-cooled AI facility, the combined PUE overhead is approximately 1.10 to 1.15 (13–15% overhead on IT load). For an air-cooled facility running high-density racks, PUE climbs to 1.25–1.45, which increases effective power cost by 25–45%.
Power is an operating expense. The GPU itself is a capital expense. To convert GPU cost into a per-token cost, we need an amortisation model.
An H200 SXM5 module costs approximately $30,000–$35,000 ex-factory in 2026. An H200 DGX system with 8 GPUs, host, networking, and storage runs $350,000–$420,000. Hyperscalers buying at volume pay somewhat less, but $30K per GPU is a reasonable working number. GB200 NVL72 racks — 72 GPUs plus NVSwitch fabric, liquid cooling, and integrated networking — list at $3.0–3.5M per rack, or approximately $42,000–$49,000 per GPU.
Data centre GPU hardware is typically depreciated over three to five years. For inference at hyperscale, a three-year depreciation schedule is standard, reflecting both hardware obsolescence and rapid model turnover. Utilisation rate — the fraction of time a GPU is generating revenue rather than idle or doing system overhead — typically runs 55–75% for production inference clusters. Below this range, too many GPUs are waiting for requests. Above it, latency SLOs become hard to maintain.
This $2.34/GPU-hour is the capital component only. It does not include power, headcount, or software licensing. It represents what the hardware costs per hour of productive use, amortised over its useful life.
NAND flash does not contribute arithmetic — but it contributes substantially to the serving cost in two ways: the capital cost of the drives installed per GPU node, and the ongoing power draw for KV offload, weight staging, and model loading.
A DGX H200 ships with four 3.84 TB E1.S NVMe drives (15.4 TB per node, 1.9 TB per GPU). For inference systems that use NVMe-based KV offload (needed for long-context serving where KV exceeds HBM capacity), operators typically provision an additional 4–8 × 7.68 TB drives, bringing per-GPU NAND to 10–20 TB.
| NAND Tier | Per-GPU capacity | $/GB (2026) | Per-GPU NAND cost | Function |
|---|---|---|---|---|
| Local NVMe (base) | 1.9 TB | $0.11 | $209 | OS, model weights staging, checkpoint absorption |
| KV offload NVMe | 8–16 TB | $0.09 | $720–$1,440 | Long-context KV cache pages evicted from HBM |
| Networked flash (shared) | ~5 TB allocated | $0.08 | $400 | Model repository, checkpoint storage, dataset hot tier |
The per-GPU NAND cost of $1,300–$2,050 is small relative to the $32,000 GPU cost (4–6%), but it is non-trivial and often omitted from simple cost models. More importantly, NVMe contributes to power: each E1.S drive draws 4–8 W active, adding 60–120 W per eight-drive node, or 7.5–15 W per GPU in shared accounting.
Tensor parallelism across GPUs requires NVLink all-reduce operations on every decode step. For a 70B model served across 4 GPUs with TP=4, each decode step requires synchronising activations totalling approximately 32 MB per all-reduce. At NVLink 4.0 bandwidth of 900 GB/s bidirectional, this takes roughly 35 microseconds — small individually, but at 1,000 tokens/second per request it accumulates.
In disaggregated serving (prefill/decode separated), the KV cache must be transferred from the prefill node to the decode node after each prefill. For a 70B model, 4K context, the KV cache is approximately 4.4 GB. Over NVLink or 400G RDMA (50 GB/s effective), this transfer takes ~88 milliseconds and adds directly to time-to-first-token (TTFT) latency. The interconnect cost of disaggregation is not negligible.
| Interconnect Path | Bandwidth | KV transfer (4K ctx, 70B) | Power |
|---|---|---|---|
| NVLink 4.0 (intra-node) | 900 GB/s bidir | ~5 ms (9.7 GB) | ~80W per GPU (NVSwitch shared) |
| 400G Ethernet (inter-node) | ~42 GB/s effective | ~104 ms | ~15W NIC + switch share |
| 800G Ethernet (inter-node) | ~84 GB/s effective | ~52 ms | ~22W NIC + switch share |
| NVLink-C2C (GB200 NVL72) | 1.8 TB/s bidir | ~5 ms | Included in NVSwitch TDP |
We now have enough components to build a complete per-token cost model. The model has five layers: capital amortisation, power (GPU + facility), memory bandwidth cost, NAND/storage, and networking. Labour, software licensing, and egress are treated as overhead multipliers.
How many tokens can a single H200 generate per second? At batch 64, context 4K, FP8, the memory bandwidth bottleneck gives us:
| Cost Layer | Derivation | $/GPU-hour | $/1M tokens (B=64) | % of total |
|---|---|---|---|---|
| Capital (GPU depreciation) | $40K over 3yr at 65% util | $2.34 | $60.6 | 35% |
| GPU power | 700W × $0.07/kWh × 1.12 PUE | $0.055 | $48.7 | 28% |
| Facility overhead (cooling, PDU) | 12% overhead on GPU power | $0.019 | $17.4 | 10% |
| NAND (amortised, power) | $1.6K drives / 3yr + 10W/GPU | $0.013 | $13.9 | 8% |
| Networking (NVLink, ToR switches) | $3K/GPU NW gear / 3yr + 20W NIC | $0.010 | $10.4 | 6% |
| Labour + software | ~18% overhead on infra cost (ML infra, SRE) | $0.026 | $22.7 | 13% |
| Total (infra + labour) | $2.46 | $173.7 | 100% |
At $174/1M output tokens, a major lab selling at $3/1M has roughly 17× margin on infrastructure cost — but this assumes the best-case serving conditions: FP8, batch 64, 4K context, 65% utilisation, and $0.07/kWh power. Under worse conditions (low batch, long context, expensive power, poor utilisation), the cost climbs rapidly, and margin compresses or inverts.
The $174 number is not a lower bound. It is a best-case number for a specific set of conditions that favour high utilisation, short context, and cheap power. Real serving fleets mix request types, context lengths, and batch fills. The actual blended cost per token for a production serving cluster is 2–5× higher.
The cost model above assumes a 4K context. This is a deliberately optimistic assumption. Modern applications routinely use 32K, 64K, and 128K context windows. The cost implications are severe and non-linear.
For a 70B model, the KV cache per token of context is:
At 128K context, the KV cache alone exceeds the HBM capacity of a single H200. There are three responses to this: (a) span KV across multiple GPUs with tensor parallelism, (b) offload overflow KV pages to NVMe and prefetch on demand, or (c) use sparse attention to limit which tokens are attended to. Each response has a cost.
| Context Length | KV size (70B FP8) | HBM status | $/1M output tokens | Cost multiplier vs. 4K |
|---|---|---|---|---|
| 2K | 2.5 GB | Fits easily | $142 | 0.82× |
| 4K (baseline) | 5.0 GB | Fits comfortably | $174 | 1.0× |
| 8K | 10 GB | Fits; limited batching | $230 | 1.3× |
| 32K | 40 GB | Fits; batch 3–4 max | $620 | 3.6× |
| 64K | 80 GB | Fits; batch 1 only | $980 | 5.6× |
| 128K | 160 GB | Requires NVMe offload or multi-GPU | $1,420 | 8.2× |
| 1M (projected) | 1.25 TB | Requires multi-node + NVMe tier | ~$9,800 | 56× |
The cost multiplier is not linear with context length. It is superlinear, driven by two compounding effects: the KV cache grows linearly with context but the batching capacity shrinks linearly, so the cost per token grows as the square of context length at the regime where KV dominates. This has profound implications for long-context pricing.
The cost model makes the levers explicit. There are six levers that meaningfully reduce per-token cost. They are not equal in magnitude or in what they trade away.
The $/1M-token figure that dominates AI pricing discussions is calculated on output tokens only. This hides several costs that are real and often large.
Input tokens require prefill, which is compute-intensive but also reads the KV cache into HBM. For a request with a 10,000-token prompt and 200-token output, the prefill cost is roughly 50× the cost of a 200-token decode run at similar batch. Labs that charge separately for input and output tokens (Anthropic, OpenAI) are properly accounting for this. Labs that charge a single output-token price are either subsidising inputs or pricing them in via a higher output price.
Time-to-first-token (TTFT) latency is a quality-of-experience metric that users care deeply about. Achieving a TTFT below 300ms for long-context requests requires either a dedicated prefill pool (disaggregation), very high batch priority for prefill, or prefill GPUs that are held partially idle to respond quickly. All three approaches increase cost. The TTFT SLO tightens the effective utilisation ceiling, which raises the capital component of per-token cost.
US East Coast electricity averages $0.07/kWh. Virginia (data centre corridor): $0.065/kWh. Northern Virginia: $0.058/kWh. Oregon (hydro): $0.045/kWh. Texas (wind): $0.038/kWh during off-peak. Singapore: $0.115/kWh. Germany: $0.165/kWh. Japan: $0.145/kWh. The power component of per-token cost can vary by more than 4× based purely on geography. Labs that serve globally at a single price must account for this in their blended cost model.
| Region | Power cost ($/kWh) | Power layer $/1M tokens | Total blended $/1M tokens |
|---|---|---|---|
| Oregon / Pacific NW (hydro) | $0.045 | $31.3 | $156 |
| Virginia (US East, grid) | $0.065 | $45.2 | $169 |
| Texas (wind mix) | $0.055 | $38.2 | $162 |
| Singapore | $0.115 | $79.9 | $215 |
| Germany | $0.165 | $114.6 | $249 |
| Japan | $0.145 | $100.8 | $235 |
The cost model assumes three-year depreciation. But H100s deployed in 2023 were largely being displaced by H200s by mid-2025, and GB200s are arriving in volume in 2026. Real depreciation schedules for AI hardware are closer to 18–24 months at the frontier, not three years. Shortening depreciation to 2 years increases the capital layer from $60.6/1M to $90.9/1M, adding $30/1M to the total cost.
The pricing wars of 2025–2026 — GPT-4o at $2.50, Claude at $3, Gemini Flash at $0.075, Deepseek at $1.10 — will eventually converge toward infrastructure cost. The labs that are currently pricing below infrastructure cost are buying developer adoption with venture capital. That is a legitimate strategy, but it is not a sustainable one.
The more interesting question is not which provider has the lowest sticker price today. It is which providers have the lowest structural cost — the cost they will converge to as compute scales, models mature, and the easy gains from quantization and batching have been fully captured.
The structural cost advantage comes from three places:
This brings us back to the opening argument of this essay, and to the central claim of the preceding work on memory-centric systems: the cost of a token is not primarily the cost of arithmetic. It is the cost of moving bytes — from NAND to HBM, from HBM to SRAM, from prefill node to decode node, from one GPU to another. The labs and system builders who understand and minimise that movement will have lower structural costs, lower energy consumption, and more margin to compete on price, capability, or latency — their choice.
A token is not a FLOP. It is a sequence of byte movements with a small amount of arithmetic attached. Price the byte movements, and you have priced the token.
← All writings
© 2026 Manish KL. All rights reserved.
Systems architecture notes on infrastructure boundaries.