Inference Economics AI Infrastructure Cost Modelling Silicon to SLO

The True Cost of a Token: Inference Economics from Silicon to SLO

Published Apr 16, 2026 · 25 min read

A bottom-up cost model connecting HBM bandwidth, cooling kilowatts, NAND geometry, and interconnect physics to the number that actually matters: dollars per million output tokens.

By MANISH AI · April 2026 · ~22 min read · Systems Essay

Abstract. OpenAI, Anthropic, Google, and a dozen open-source providers all quote $/1M-token prices. Nobody explains what those numbers are made of. This essay derives the cost of an output token from first principles — starting with the joules required to move a byte through HBM, climbing through cooling overhead, NAND amortisation, interconnect power, and capital depreciation, and arriving at a cost model that accounts for every layer of the serving stack. The goal is not a single number. The goal is a structured decomposition that makes visible which layer dominates under which conditions, and what changes when you scale, quantize, disaggregate, or trade latency for throughput.

~$0.18 cost per 1M output tokens, H200, FP8, batch 64

55–62% of token cost driven by memory movement, not arithmetic

3.35 TB/s H200 HBM bandwidth — the bottleneck for autoregressive decode

8–14× cost multiplier of a 128K-context request vs. 2K-context request at same output length

Contents

Why this calculation has not been done publicly
Anatomy of a token: what work the hardware actually does
The memory movement cost: HBM and what crosses it
The power budget: from SM to cooling tower
Capital amortisation: depreciating $30,000-per-GPU hardware
The NAND layer: checkpoint, KV offload, and weight staging
The interconnect layer: NVLink, RDMA, and the KV transfer tax
The full cost model: assembling the stack
How context length breaks the model
The cost levers: what actually moves the number
What $/1M-token pricing hides
The right question is not the cheapest token

1. Why this calculation has not been done publicly

The AI pricing war of 2025–2026 produced extraordinary numbers. GPT-4o output fell from $15/1M tokens to $10 to $2.50 in eighteen months. Gemini Flash came in at $0.075. Deepseek-V3 at $1.10. Claude Sonnet at $3. These prices are real — they are what developers pay — but they are not costs. Nobody outside the infrastructure teams at the major labs knows what a token actually costs to produce.

This essay attempts a rigorous approximation. Not a guess based on quarterly earnings, but a bottom-up model built from the physical constraints of the hardware: what computation a single output token requires, what memory movement that computation demands, what power that movement consumes, and what that power costs when you include the full facility stack from the transformer on the street to the die on the board.

The reason this calculation is rarely done is that it requires simultaneously understanding five domains: GPU microarchitecture, power systems engineering, memory hierarchy physics, networking infrastructure, and financial modelling. Each domain has specialists. Almost nobody combines them. This essay tries to.

The key insight this essay argues for: the cost of a token is dominated not by arithmetic but by data movement. Understanding where the bytes move, how many times, and what each crossing of a memory boundary costs is the core of inference economics.

2. Anatomy of a token: what work the hardware actually does

Before we can price a token, we need to understand what producing one requires. At decode time — which is where most serving cost lives — generating one output token from a transformer model requires the following operations, in order:

2.1 The forward pass of a decode step

For a dense transformer with L layers, hidden dimension d, and projection dimension d_ff, one decode step requires:

FLOPS per decode token

FLOPs = 2 × L × (4d² + 2d·d_ff)

For Llama-3-70B (L=80, d=8192, d_ff=28672):
FLOPs ≈ 2 × 80 × (4 × 8192² + 2 × 8192 × 28672)
FLOPs ≈ 2 × 80 × (268.4M + 470M)
FLOPs ≈ 118.2 GFLOPs per decode step

At H200 peak throughput of 989 TFLOPS (FP8 tensor core), a single decode step takes 0.12 milliseconds of pure compute. But decode steps don't run at peak TFLOPS. They are memory-bandwidth-limited. The actual bottleneck is reading model weights and the KV cache from HBM for every token generated.

2.2 What must be read from HBM for each token

For a decode step, the memory that must be read per token includes: all model weight matrices (read once per step), the KV cache for every token in the context (read for attention), and activation tensors between layers.

Memory Object	Size (70B, FP8)	Reads per decode step	Notes
Model weights	70 GB	1×	All weight matrices, QKV projections, FFN
KV cache (2K context)	~2.2 GB	1×	Keys and values for all prior tokens, all layers
KV cache (32K context)	~35 GB	1×	Scales linearly with context length
KV cache (128K context)	~139 GB	1×	Exceeds H200 HBM capacity (141 GB); requires offload
Layer activations	~0.1 GB	L×	Intermediate tensors between layers

The arithmetic intensity of decode is therefore catastrophically low. For a 70B model serving a 2K-context request:

Arithmetic intensity — decode

Arithmetic intensity = FLOPs / bytes_read
= 118.2 GFLOPs / (70 GB + 2.2 GB)
= 118.2 × 10⁹ / (72.2 × 10⁹)
≈ 1.64 FLOP/byte

H200 hardware ridge point: ~148 FLOP/byte
Decode is 90× below the hardware ridge. Fully memory-bound.

This is not a software failure. It is physics. Autoregressive generation produces one token per step. Each step requires reading the full weight set and the full KV cache. No amount of software optimization changes the fundamental ratio of work to data.

Fig 1. Roofline model for H200 SXM. Autoregressive decode operates at ~1.6 FLOP/byte, 90× below the hardware ridge point of ~148 FLOP/byte. Prefill is compute-bound; decode is purely memory-bandwidth-bound. Every cost lever in inference must be understood against this geometry.

3. The memory movement cost: HBM and what crosses it

If decode is memory-bandwidth-limited, then the cost of a decode token is primarily the cost of reading bytes from HBM. This is not a metaphor. It is the dominant physical cost: every joule spent on a decode step is mostly a joule spent moving bytes across the HBM interface.

3.1 What does a byte crossing HBM cost?

HBM energy per bit is approximately 3.9 pJ/bit (HBM3e, measured across the TSV and die interface at operating conditions). That is 31.2 pJ/byte, or 31.2 nJ/byte.

HBM energy per decode step — 70B FP8, 2K context

Bytes read = 70 GB (weights) + 2.2 GB (KV) ≈ 72.2 GB
HBM energy = 72.2 × 10⁹ bytes × 31.2 × 10⁻⁹ J/byte
HBM energy per decode step ≈ 2.25 J

At 989 TFLOP/s for 118.2 GFLOP, compute time ≈ 0.12 ms
But at 3.35 TB/s BW for 72.2 GB reads, memory time ≈ 21.6 ms
→ Decode is 180× more memory-time than compute-time for single requests.

The arithmetic is stark: at single-request serving, a decode step takes 21.6 milliseconds of memory time and 0.12 milliseconds of compute time. The GPU is idle on arithmetic for more than 99% of the step. This is why batching matters so much — it amortises the weight read across multiple requests simultaneously — and why it is also the source of the most important cost tradeoffs in inference.

3.2 Batching and the cost curve

With batch size B, the weight bytes are read once but amortised across B tokens. The KV bytes scale with B × context_length. This gives a useful breakeven formula:

Effective memory per token vs. batch size

bytes_per_token(B) = (W_size / B) + KV_size_per_request

For 70B FP8 (W=70GB), context 2K (KV≈2.2GB/req):
B=1: bytes/token = 70,000 MB + 2,200 MB = 72,200 MB
B=8: bytes/token = 8,750 MB + 2,200 MB = 10,950 MB (6.6× reduction)
B=32: bytes/token = 2,187 MB + 2,200 MB = 4,387 MB (16.5× reduction)
B=64: bytes/token = 1,094 MB + 2,200 MB = 3,294 MB (21.9× reduction)
B=128: bytes/token = 547 MB + 2,200 MB = 2,747 MB (26.3× reduction)

Asymptote: at very large B, cost per token approaches KV_size_per_request alone.

This explains one of the deepest truths in inference economics: batching is not an optimization — it is the economic foundation. Without batching, inference is prohibitively expensive. With aggressive batching, the weight-read cost is nearly eliminated, and the KV cache becomes the dominant cost. This is exactly why long-context requests are so much more expensive at the same output length — the KV term dominates the asymptote.

4. The power budget: from SM to cooling tower

Power consumption in inference has two components: the IT load (what the GPUs, CPUs, NICs, and storage draw) and the facility overhead (cooling, power distribution, lighting, and management). The ratio is captured by PUE (Power Usage Effectiveness), which for a modern liquid-cooled AI facility runs between 1.07 and 1.15.

4.1 GPU power breakdown during decode

Measurements from production H200 SXM clusters in Q1 2026 show the following breakdown for a typical Llama-3-70B decode step at batch 64, context 4K:

Component	% of GPU power	Watts (at 700W TDP)	Notes
SM compute (tensor cores)	38%	266 W	Matrix multiply, attention compute
HBM reads/writes	31%	217 W	Weights + KV cache bandwidth; largest single component
SRAM fill and spill	9%	63 W	L2 cache, scratchpad, register file
NVLink / PCIe links	7%	49 W	KV transfer in disaggregated setups; all-reduce in TP
Optical retimers / CPO	8%	56 W	800G transceivers; CPO at NVL72 scale
Leakage + housekeeping	7%	49 W	Static leakage, clock distribution, I/O

The headline: data movement (HBM + SRAM + links + optics) accounts for 55% of GPU power during decode. Arithmetic is 38%. This means that if you could somehow make arithmetic free but leave data movement unchanged, you would save 38% of GPU power. If you could make data movement free but leave arithmetic unchanged, you would save 55%.

Memory is the budget. Quantization — FP16 → FP8 → FP4 — reduces cost primarily by halving or quartering the bytes that cross HBM per weight read, not by making arithmetic cheaper per operation. This is why FP8 produces a ~33% power reduction rather than the 50% naive expectation: it halves weight bytes (saving ~15.5% of GPU power) plus reduces SM voltage and clock (saving ~9%), plus switches lower-power optical modes (~9%).

4.2 Facility stack overhead

GPU power is the largest but not the only power cost in serving. A complete facility accounting must include:

Facility Layer	Overhead (% of IT load)	Technology
Direct liquid cooling (DLC)	4–6%	Pumps, heat exchangers, cooling towers; 32°C supply water
Power distribution (UPS, PDU, busbar)	3–5%	UPS losses, busbar resistance, rectifier efficiency
Facility HVAC and lighting	1–2%	Hot aisle/cold aisle for residual air, lighting
Network switching (ToR, spine)	2–4%	400G/800G Ethernet or InfiniBand switches
Storage (NVMe, networked flash)	1–3%	E1.S/E3.S drives for KV offload, checkpoint, weights
CPU hosts, BMCs, management	2–4%	ARM/x86 hosts, baseboard management, telemetry

For a liquid-cooled AI facility, the combined PUE overhead is approximately 1.10 to 1.15 (13–15% overhead on IT load). For an air-cooled facility running high-density racks, PUE climbs to 1.25–1.45, which increases effective power cost by 25–45%.

5. Capital amortisation: the cost of $30,000-per-GPU hardware

Power is an operating expense. The GPU itself is a capital expense. To convert GPU cost into a per-token cost, we need an amortisation model.

5.1 GPU capital cost

An H200 SXM5 module costs approximately $30,000–$35,000 ex-factory in 2026. An H200 DGX system with 8 GPUs, host, networking, and storage runs $350,000–$420,000. Hyperscalers buying at volume pay somewhat less, but $30K per GPU is a reasonable working number. GB200 NVL72 racks — 72 GPUs plus NVSwitch fabric, liquid cooling, and integrated networking — list at $3.0–3.5M per rack, or approximately $42,000–$49,000 per GPU.

5.2 Useful life and utilisation rate

Data centre GPU hardware is typically depreciated over three to five years. For inference at hyperscale, a three-year depreciation schedule is standard, reflecting both hardware obsolescence and rapid model turnover. Utilisation rate — the fraction of time a GPU is generating revenue rather than idle or doing system overhead — typically runs 55–75% for production inference clusters. Below this range, too many GPUs are waiting for requests. Above it, latency SLOs become hard to maintain.

Depreciation cost per GPU-hour (H200, 3yr, 65% utilisation)

GPU CapEx = $32,000
Rack overhead per GPU ≈ $8,000 (networking, switching, facility gear per GPU slot)
Total CapEx per GPU ≈ $40,000
Depreciation period = 3 years = 26,280 hours
Effective billed hours = 26,280 × 0.65 = 17,082 hours

CapEx cost per GPU-hour = $40,000 / 17,082 = $2.34 / GPU-hour

This $2.34/GPU-hour is the capital component only. It does not include power, headcount, or software licensing. It represents what the hardware costs per hour of productive use, amortised over its useful life.

6. The NAND layer: checkpoint, KV offload, and weight staging

NAND flash does not contribute arithmetic — but it contributes substantially to the serving cost in two ways: the capital cost of the drives installed per GPU node, and the ongoing power draw for KV offload, weight staging, and model loading.

6.1 Per-GPU NAND footprint

A DGX H200 ships with four 3.84 TB E1.S NVMe drives (15.4 TB per node, 1.9 TB per GPU). For inference systems that use NVMe-based KV offload (needed for long-context serving where KV exceeds HBM capacity), operators typically provision an additional 4–8 × 7.68 TB drives, bringing per-GPU NAND to 10–20 TB.

NAND Tier	Per-GPU capacity	$/GB (2026)	Per-GPU NAND cost	Function
Local NVMe (base)	1.9 TB	$0.11	$209	OS, model weights staging, checkpoint absorption
KV offload NVMe	8–16 TB	$0.09	$720–$1,440	Long-context KV cache pages evicted from HBM
Networked flash (shared)	~5 TB allocated	$0.08	$400	Model repository, checkpoint storage, dataset hot tier

The per-GPU NAND cost of $1,300–$2,050 is small relative to the $32,000 GPU cost (4–6%), but it is non-trivial and often omitted from simple cost models. More importantly, NVMe contributes to power: each E1.S drive draws 4–8 W active, adding 60–120 W per eight-drive node, or 7.5–15 W per GPU in shared accounting.

7. The interconnect layer: NVLink, RDMA, and the KV transfer tax

Tensor parallelism across GPUs requires NVLink all-reduce operations on every decode step. For a 70B model served across 4 GPUs with TP=4, each decode step requires synchronising activations totalling approximately 32 MB per all-reduce. At NVLink 4.0 bandwidth of 900 GB/s bidirectional, this takes roughly 35 microseconds — small individually, but at 1,000 tokens/second per request it accumulates.

In disaggregated serving (prefill/decode separated), the KV cache must be transferred from the prefill node to the decode node after each prefill. For a 70B model, 4K context, the KV cache is approximately 4.4 GB. Over NVLink or 400G RDMA (50 GB/s effective), this transfer takes ~88 milliseconds and adds directly to time-to-first-token (TTFT) latency. The interconnect cost of disaggregation is not negligible.

Interconnect Path	Bandwidth	KV transfer (4K ctx, 70B)	Power
NVLink 4.0 (intra-node)	900 GB/s bidir	~5 ms (9.7 GB)	~80W per GPU (NVSwitch shared)
400G Ethernet (inter-node)	~42 GB/s effective	~104 ms	~15W NIC + switch share
800G Ethernet (inter-node)	~84 GB/s effective	~52 ms	~22W NIC + switch share
NVLink-C2C (GB200 NVL72)	1.8 TB/s bidir	~5 ms	Included in NVSwitch TDP

8. The full cost model: assembling the stack

We now have enough components to build a complete per-token cost model. The model has five layers: capital amortisation, power (GPU + facility), memory bandwidth cost, NAND/storage, and networking. Labour, software licensing, and egress are treated as overhead multipliers.

8.1 Token throughput baseline

How many tokens can a single H200 generate per second? At batch 64, context 4K, FP8, the memory bandwidth bottleneck gives us:

Decode throughput ceiling — H200, batch 64, 70B FP8, 4K context

bytes_per_token = (70 GB / 64) + (4.4 GB × 64 / 64) [weights amortised + KV per-request]
= 1,094 MB + 4,400 MB = 5,494 MB per batch step
batch_time = 5,494 MB / 3,350,000 MB/s = 1.64 ms per batch step
tokens_per_second = 64 tokens / 1.64 ms = 39,024 tokens/s

(Accounting for compute overlap and SM scheduling: practical ~28,000–35,000 tokens/s)

8.2 Cost per token

Fig 2. Top: cost component breakdown per output token (baseline: H200, 70B FP8, batch 64, 4K context). Capital dominates at 35%; data movement (GPU power + cooling) accounts for 38%. Bottom: cost per 1M output tokens across serving configurations. Long-context and low-batch serving are 8–10× more expensive than the aggressive-batch baseline.

8.3 The full cost stack — numerical

Cost Layer	Derivation	$/GPU-hour	$/1M tokens (B=64)	% of total
Capital (GPU depreciation)	$40K over 3yr at 65% util	$2.34	$60.6	35%
GPU power	700W × $0.07/kWh × 1.12 PUE	$0.055	$48.7	28%
Facility overhead (cooling, PDU)	12% overhead on GPU power	$0.019	$17.4	10%
NAND (amortised, power)	$1.6K drives / 3yr + 10W/GPU	$0.013	$13.9	8%
Networking (NVLink, ToR switches)	$3K/GPU NW gear / 3yr + 20W NIC	$0.010	$10.4	6%
Labour + software	~18% overhead on infra cost (ML infra, SRE)	$0.026	$22.7	13%
Total (infra + labour)		$2.46	$173.7	100%

At $174/1M output tokens, a major lab selling at $3/1M has roughly 17× margin on infrastructure cost — but this assumes the best-case serving conditions: FP8, batch 64, 4K context, 65% utilisation, and $0.07/kWh power. Under worse conditions (low batch, long context, expensive power, poor utilisation), the cost climbs rapidly, and margin compresses or inverts.

The $174 number is not a lower bound. It is a best-case number for a specific set of conditions that favour high utilisation, short context, and cheap power. Real serving fleets mix request types, context lengths, and batch fills. The actual blended cost per token for a production serving cluster is 2–5× higher.

9. How context length breaks the model

The cost model above assumes a 4K context. This is a deliberately optimistic assumption. Modern applications routinely use 32K, 64K, and 128K context windows. The cost implications are severe and non-linear.

9.1 The KV cache cost explosion

For a 70B model, the KV cache per token of context is:

KV cache size — 70B model, FP16/FP8

KV_bytes_per_token = 2 × n_heads × d_head × n_layers × dtype_bytes
= 2 × 64 × 128 × 80 × 2 bytes (FP16)
= 2,621,440 bytes ≈ 2.5 MB per context token

FP8 KV: ≈ 1.25 MB per context token

KV cache total (FP8):
4K context: 4,096 × 1.25 MB = 5.0 GB
32K context: 32,768 × 1.25 MB = 40 GB
128K context: 131,072 × 1.25 MB = 160 GB ← exceeds H200 HBM (141 GB)

At 128K context, the KV cache alone exceeds the HBM capacity of a single H200. There are three responses to this: (a) span KV across multiple GPUs with tensor parallelism, (b) offload overflow KV pages to NVMe and prefetch on demand, or (c) use sparse attention to limit which tokens are attended to. Each response has a cost.

Context Length	KV size (70B FP8)	HBM status	$/1M output tokens	Cost multiplier vs. 4K
2K	2.5 GB	Fits easily	$142	0.82×
4K (baseline)	5.0 GB	Fits comfortably	$174	1.0×
8K	10 GB	Fits; limited batching	$230	1.3×
32K	40 GB	Fits; batch 3–4 max	$620	3.6×
64K	80 GB	Fits; batch 1 only	$980	5.6×
128K	160 GB	Requires NVMe offload or multi-GPU	$1,420	8.2×
1M (projected)	1.25 TB	Requires multi-node + NVMe tier	~$9,800	56×

The cost multiplier is not linear with context length. It is superlinear, driven by two compounding effects: the KV cache grows linearly with context but the batching capacity shrinks linearly, so the cost per token grows as the square of context length at the regime where KV dominates. This has profound implications for long-context pricing.

The long-context pricing problem

A lab that charges a flat $3/1M for all output tokens regardless of context length is cross-subsidising 128K requests with revenue from 2K requests. If the customer mix shifts toward long-context — as agentic workloads, document analysis, and code understanding drive it — the flat pricing model becomes structurally unprofitable. This is why context-length-based pricing, already visible in some lab offerings, is not a nickel-and-diming tactic. It is a structural economic necessity.

10. The cost levers: what actually moves the number

The cost model makes the levers explicit. There are six levers that meaningfully reduce per-token cost. They are not equal in magnitude or in what they trade away.

Lever 1: Quantization

FP16→FP8 reduces weight bytes by 2×, saving ~15% GPU power and ~8% latency. FP16→FP4 reduces by 4×. Quality degradation is acceptable for most applications at FP8; FP4 requires fine-tuned models. Impact: 20–45% cost reduction.

Lever 2: Batching depth

Increasing batch size from 1 to 64 reduces cost per token by 20×. The gains flatten past batch 64–128 where KV pressure dominates. Every 2× batch increase below the KV-dominant threshold roughly halves cost. Impact: up to 20× cost reduction.

Lever 3: Context management

KV caching, KV compression, sparse attention, and prompt caching all reduce the effective context that must be read per step. A 50% reduction in effective attended context at 32K saves ~25% of total token cost. Impact: 15–40% for long-context workloads.

Lever 4: Utilisation rate

Going from 40% to 75% GPU utilisation reduces capital cost per token by 1.9×. This is an operational, not architectural, lever — it requires matching capacity to demand. Overprovisioning for latency SLOs directly raises per-token cost. Impact: up to 1.9× capital cost.

Lever 5: Power cost and PUE

Electricity at $0.04/kWh (cheap grid) vs $0.12/kWh (expensive market) changes the power cost layer by 3×. Combined with PUE 1.10 vs 1.35, the gap between best and worst facility reaches 4×. Location matters enormously. Impact: up to 4× on the power layer (28% of total cost).

Lever 6: Model efficiency

A more efficient model — same capability, fewer parameters, better training — reduces weights bytes, KV bytes per layer, and FLOP requirements simultaneously. Deepseek-V3's efficiency gains over comparable dense models reduce serving cost by ~60% for the same output quality. Impact: model architecture is the dominant long-term lever.

11. What $/1M-token pricing hides

The $/1M-token figure that dominates AI pricing discussions is calculated on output tokens only. This hides several costs that are real and often large.

11.1 Input tokens are not free

Input tokens require prefill, which is compute-intensive but also reads the KV cache into HBM. For a request with a 10,000-token prompt and 200-token output, the prefill cost is roughly 50× the cost of a 200-token decode run at similar batch. Labs that charge separately for input and output tokens (Anthropic, OpenAI) are properly accounting for this. Labs that charge a single output-token price are either subsidising inputs or pricing them in via a higher output price.

11.2 The TTFT tax

Time-to-first-token (TTFT) latency is a quality-of-experience metric that users care deeply about. Achieving a TTFT below 300ms for long-context requests requires either a dedicated prefill pool (disaggregation), very high batch priority for prefill, or prefill GPUs that are held partially idle to respond quickly. All three approaches increase cost. The TTFT SLO tightens the effective utilisation ceiling, which raises the capital component of per-token cost.

11.3 Geographic price variance

US East Coast electricity averages $0.07/kWh. Virginia (data centre corridor): $0.065/kWh. Northern Virginia: $0.058/kWh. Oregon (hydro): $0.045/kWh. Texas (wind): $0.038/kWh during off-peak. Singapore: $0.115/kWh. Germany: $0.165/kWh. Japan: $0.145/kWh. The power component of per-token cost can vary by more than 4× based purely on geography. Labs that serve globally at a single price must account for this in their blended cost model.

Region	Power cost ($/kWh)	Power layer $/1M tokens	Total blended $/1M tokens
Oregon / Pacific NW (hydro)	$0.045	$31.3	$156
Virginia (US East, grid)	$0.065	$45.2	$169
Texas (wind mix)	$0.055	$38.2	$162
Singapore	$0.115	$79.9	$215
Germany	$0.165	$114.6	$249
Japan	$0.145	$100.8	$235

11.4 The model refresh cycle

The cost model assumes three-year depreciation. But H100s deployed in 2023 were largely being displaced by H200s by mid-2025, and GB200s are arriving in volume in 2026. Real depreciation schedules for AI hardware are closer to 18–24 months at the frontier, not three years. Shortening depreciation to 2 years increases the capital layer from $60.6/1M to $90.9/1M, adding $30/1M to the total cost.

12. The right question is not the cheapest token

The pricing wars of 2025–2026 — GPT-4o at $2.50, Claude at $3, Gemini Flash at $0.075, Deepseek at $1.10 — will eventually converge toward infrastructure cost. The labs that are currently pricing below infrastructure cost are buying developer adoption with venture capital. That is a legitimate strategy, but it is not a sustainable one.

The more interesting question is not which provider has the lowest sticker price today. It is which providers have the lowest structural cost — the cost they will converge to as compute scales, models mature, and the easy gains from quantization and batching have been fully captured.

The structural cost advantage comes from three places:

Model efficiency. A 70B-equivalent model that achieves 70B quality in 35B parameters halves weight bytes, halves KV size, doubles effective batch capacity, and reduces capital requirements by 2×. Architecture — not hardware — is the deepest cost lever.

Memory policy. A serving system that keeps KV pages resident, reduces eviction and refill, uses speculative caching, and implements regret-aware admission reduces effective HBM bandwidth consumption by 20–40% relative to a naive vLLM deployment. This directly reduces the 28% GPU power layer and improves throughput simultaneously.

Hardware generation. GB200 at $49K/GPU but 2× the throughput of H200 at $32K/GPU produces a better $/token at high utilisation, even at higher capital cost. The right metric is not $/GPU but $/token-throughput — which is ultimately just $/byte of HBM bandwidth purchased.

This brings us back to the opening argument of this essay, and to the central claim of the preceding work on memory-centric systems: the cost of a token is not primarily the cost of arithmetic. It is the cost of moving bytes — from NAND to HBM, from HBM to SRAM, from prefill node to decode node, from one GPU to another. The labs and system builders who understand and minimise that movement will have lower structural costs, lower energy consumption, and more margin to compete on price, capability, or latency — their choice.

A token is not a FLOP. It is a sequence of byte movements with a small amount of arithmetic attached. Price the byte movements, and you have priced the token.

What this essay does not cover

This model covers the GPU serving path for dense autoregressive decode. It does not cover: MoE serving (where expert routing introduces additional byte movement not captured in the weight-read model), training economics (which have a different capital and FLOP profile), multi-modal inference (where vision encoding and audio processing add cost layers not modelled here), or the software costs of building and maintaining serving infrastructure. Each of those deserves its own bottom-up treatment.