The Inference Unit Economics Time Bomb: Why $/Token Will Collapse (and What Survives)
- The pricing war and what it reveals
- The cost stack: which components fall and how fast
- Why compute costs are collapsing
- Why memory costs are the floor
- The network and cooling constants
- Deriving the physical cost floor
- Convergence timeline: when does pricing hit the floor?
- What survives: the two durable economic positions
- The infrastructure moat: what it actually requires
- Implications for infrastructure investment decisions today
1. The pricing war and what it reveals
In eighteen months, GPT-4o output prices fell from $15/1M tokens to $2.50. Gemini Flash arrived at $0.075. DeepSeek-V3 at $0.28. Claude Haiku below $1. The price trajectory looks like a vertical drop. To an economist, this looks like a commodity market forming: many suppliers, falling production costs, prices converging toward marginal cost.
That framing is partially right. But it misses the structure of the cost. Not all components of inference cost fall at the same rate. And the components that fall slowest are the ones determined by physical constants — the energy cost of moving a byte through silicon, the thermal limits of a watt-dense rack, the bandwidth ceiling of a copper trace. These do not respond to Moore's Law the same way logic transistors do.
Understanding which costs fall fast and which are structural floors is the key to understanding where AI inference economics will land, and what kind of business can survive at those prices.
The core argument: $/token is converging to a physical floor, not to zero. The floor is set by the cost of memory bandwidth, interconnect, and cooling — all of which improve slowly. When compute costs fall below the floor, the economics of inference become entirely about who operates the cheapest memory and network infrastructure. That is a different business than it appears today.
2. The cost stack: which components fall and how fast
The prior essay in this series decomposed the cost of a token into its physical constituents. To reason about future prices, we need to categorize those constituents by their rate of cost reduction.
The critical observation is that the largest cost component — HBM bandwidth at ~38% — falls at the slowest rate. Compute, which falls fastest, is the smallest proportional cost at normal batch sizes. This means that as hardware generations progress, the composition of cost shifts: compute approaches zero as a percentage of total cost, and memory bandwidth becomes the overwhelming driver.
| Cost Component | Today's Share | Reduction Rate | Mechanism | Share in 2028 (est.) |
|---|---|---|---|---|
| HBM bandwidth | ~38% | 1.5–2× per gen | HBM4 BW increase; pJ/bit improvements | ~52% |
| GPU compute (FLOP) | ~28% | 3–4× per gen | Transistor density, tensor core efficiency | ~8% |
| Capital depreciation | ~22% | 2–3× per gen | GPU price/performance, utilization improvements | ~22% |
| Cooling + power | ~12% | ~1.2× per gen | Cooling efficiency improvements; power density limits | ~18% |
By 2028, under reasonable hardware trajectory assumptions, HBM bandwidth and cooling together will represent over 70% of the cost of a token. Compute will be negligible. The economics of inference will have become the economics of efficient memory and thermal management.
3. Why compute costs are collapsing
Compute cost per FLOP has been falling at approximately 3-4× per generation because of two compounding improvements: transistor density (more FLOPs per die) and architectural efficiency (more useful FLOPs per transistor through tensor core specialization). The H100 to B200 to Rubin trajectory delivers roughly this rate.
But compute cost per FLOP is not the same as compute cost per token. As we established in the prior essay, autoregressive decode is arithmetic-intensity-limited, not compute-limited. The GPU's tensor cores are idle more than 99% of the time during single-request decode because the work rate is bounded by HBM bandwidth, not arithmetic throughput. More FLOP/s on a faster chip does not help a workload that cannot feed those FLOPs with data.
Compute utilization vs. generation at decode timeH100 SXM: 989 TFLOP/s (FP8), 3.35 TB/s HBM BW, ridge = ~295 FLOP/byte
Decode AI ≈ 1.6 FLOP/byte → utilization = 1.6/295 ≈ 0.5%
B200: ~2,200 TFLOP/s (FP4), 8.0 TB/s HBM BW, ridge = ~275 FLOP/byte
Decode AI ≈ 1.6 FLOP/byte → utilization = 1.6/275 ≈ 0.6%
Rubin (est.): ~4,000 TFLOP/s, ~12 TB/s HBM BW, ridge = ~333 FLOP/byte
Decode AI ≈ 1.6 FLOP/byte → utilization = 1.6/333 ≈ 0.5%
Pattern: compute utilization stays ~0.5-0.6% regardless of generation.
What improves is HBM bandwidth, which directly reduces decode latency.
FLOP increases are not load-bearing for single-request decode.
This is why compute cost falls so fast in the cost stack: more FLOP/s is added with each generation at roughly the price of the prior generation, but those FLOP/s are not doing useful work. The useful improvement is the HBM bandwidth increase that comes alongside. The silicon is cheap to make more powerful; the bandwidth is hard to scale.
4. Why memory costs are the floor
HBM bandwidth is limited by physics that don't respond to Moore's Law at the same rate as logic transistors. The relevant physical constraints are:
TSV density: HBM stacks multiple DRAM dies using Through-Silicon Vias (TSVs) — vertical copper connections passing through the silicon. TSV pitch is currently limited to ~50 μm, and increasing TSV density to increase bandwidth requires process advances in DRAM packaging that are separate from logic node improvements. HBM4 is expected to bring approximately 2× bandwidth increase over HBM3e, but this requires new packaging infrastructure that does not scale at logic cadence.
Energy per bit: The energy required to move a bit from DRAM to the compute die across the HBM interface is approximately 3.9 pJ/bit (HBM3e). The physics of DRAM read — charging a bitline, amplifying a sense signal, driving a TSV — has an irreducible energy floor. Engineering improvements can reduce overhead, but the fundamental electrical work required cannot be made arbitrarily small. Across multiple HBM generations, energy per bit has improved at roughly 1.3-1.5× per generation.
Capacity per package: HBM stacking is limited by the ability to stack additional dies and the thermal management of the stack. HBM3e provides 24-36 GB per package. HBM4 targets 48 GB. The increase is driven by die stacking height improvements, but thermal limits constrain how many dies can be stacked before the stack becomes unmanageable.
None of these constraints will be removed by better logic fabrication. They are packaging and DRAM physics constraints. The consequence is that HBM bandwidth improvement per dollar spent on silicon will continue to trail logic improvement by roughly 2× per generation.
5. The network and cooling constants
Two other cost components have even slower improvement trajectories: network interconnect power and cooling cost per watt.
Interconnect power: The energy cost of transmitting a bit between two chips is approximately 5-12 pJ/bit for NVLink, 10-20 pJ/bit for InfiniBand at 400G, and 15-30 pJ/bit for long-reach optical transceivers (before co-packaged optics). These numbers improve slowly — driven by laser and modulator efficiency improvements in photonics — but the improvement rate is roughly 1.2-1.4× per generation. For MoE models where inter-GPU dispatch is structurally required, interconnect power is not optional and does not fall fast.
Cooling cost per watt: The ratio of cooling power consumed per watt of compute power delivered is the Power Usage Effectiveness overhead. Modern direct liquid cooling achieves PUE of 1.1-1.15, meaning 10-15% overhead for cooling on top of compute power. This is close to the thermodynamic limit for liquid cooling at current fluid temperatures. Vera Rubin's 45°C supply temperature design pushes this toward the limit. Further improvements will come from warmer supply temperatures and direct chip immersion, but the headroom is small — perhaps 1.1× improvement before hitting fundamental thermodynamic limits.
Cooling is currently 12% of token cost. It falls at ~1.2× per generation while compute falls at 3-4×. By 2028, cooling will be ~18-20% of token cost. By 2030, it may be 25-30%. The absolute cost is falling, but it is the slowest-falling component, and it will dominate cost structure as compute approaches zero.
6. Deriving the physical cost floor
The physical cost floor is the minimum $/1M tokens achievable if all other costs were zero and only the irreducible physical costs remained. We can estimate it from the energy physics.
Physical cost floor derivation — 70B model, 2028 hardware estimateAssumptions (2028 hardware estimate, Rubin-class):
Model: 70B parameters, FP4 quantization → ~35 GB weights
HBM bandwidth: 12 TB/s per package
Energy per bit (HBM4): ~2.5 pJ/bit (1.5× improvement over HBM3e)
Context: 2K tokens (short-context floor case)
Memory energy per decode step:
Bytes read = 35 GB (weights) + 1.5 GB (KV, 2K context) ≈ 36.5 GB
Energy = 36.5 × 10⁹ bytes × 8 bits/byte × 2.5 × 10⁻¹² J/bit
= 0.73 J per decode step
Electricity cost:
Industrial power rate: $0.04/kWh (hyperscale with long-term contracts)
PUE: 1.1 (best-case DLC)
Effective rate with PUE: $0.044/kWh = $1.22 × 10⁻⁸ per joule
Memory energy cost per decode step:
= 0.73 J × $1.22 × 10⁻⁸/J = $8.9 × 10⁻⁹ per token
Memory energy cost per 1M tokens:
= $8.9 × 10⁻³ ≈ $0.009/1M tokens (memory energy only)
Adding interconnect, cooling overhead, and capital at efficient utilization:
Total physical floor estimate (2028): ~$0.015 – $0.025 / 1M output tokens
The $0.015-0.025/1M token range is not a pricing prediction — it is the energy physics floor below which it is not possible to profitably serve tokens at scale, regardless of how efficient the software or how aggressive the capital structure. It is the number where the laws of thermodynamics, not market competition, become the binding constraint.
For context: GPT-4o output is currently priced at $10/1M tokens. The floor is roughly 400-700× below current pricing. But the floor is not the interesting number — the convergence path is.
7. Convergence timeline: when does pricing hit the floor?
Current pricing is a mix of cost recovery, competitive positioning, and margin capture. The pricing war is compressing margins across the industry. The question is how quickly prices approach the physical floor.
Price convergence model — output token pricing trajectoryReference points (approximate, 2024-2026):
GPT-4o output: Apr 2024: $15/1M → Apr 2026: $2.50/1M (6× reduction, 24 months)
Gemini Flash: Launch at $0.075/1M → stable ~$0.07/1M
DeepSeek-V3: ~$0.28/1M output
Haiku-class (est.): ~$0.25 – $0.80/1M output
Price reduction rate: ~3× per 12-18 months for frontier models
~1.5× per 12-18 months for commodity-tier models
Projected convergence:
Frontier-tier (GPT-4o class): ~$0.50/1M by mid-2027, ~$0.10 by end-2027
Commodity-tier (Flash class): ~$0.02-0.05/1M by mid-2027
Physical floor (2028 estimate): ~$0.015-0.025/1M
Implication: commodity inference pricing hits the physics floor
approximately late 2027 to mid-2028 at current trajectory.
What happens then: price competition continues, but below the floor,
providers are subsidizing inference. Sustainable only
with quality moat (charging premium) or efficiency moat
(genuinely lower cost).
The convergence timeline is uncertain — hardware roadmaps slip, new architectures (speculative decoding, diffusion LMs) change the cost structure, new entrants subsidize prices. But the structural dynamic is not uncertain: the cost floor exists, and competitive pressure is pushing prices toward it from above.
8. What survives: the two durable economic positions
When prices hit the physics floor, the economics of the inference market become clear. There are exactly two positions from which a provider can operate profitably:
8.1 The quality moat: charging a premium that the floor cannot compress
If your model is measurably better than commodity alternatives on tasks that matter to buyers — coding, reasoning, multimodal understanding, domain-specific accuracy — you can charge above the physics floor. The premium is not infinite: it is bounded by how much better your model is and how much customers value that difference. But it is real and durable for as long as the quality gap exists.
The quality moat requires continuous investment in model capability — training, data, RLHF, research. It is not a one-time advantage. Competitors with the same compute budget can eventually close quality gaps. The companies that will hold quality moats in 2028 are the ones making the largest training bets today. This is a capital-intensive strategy where the moat is the accumulated investment, not any particular technical trick.
8.2 The efficiency moat: operating below the average physics floor
The physics floor is an average, not a constant. It is the energy cost of moving bytes for a specific model, on specific hardware, at a specific utilization rate. A provider who can operate with higher hardware utilization, better memory efficiency, superior batching, and lower cooling overhead can operate at a cost below the average floor — serving profitably at prices where competitors cannot.
This is the infrastructure moat. It is not about having better hardware than your competitors — in a world where H200s and B200s are available to everyone, hardware is commoditized. It is about getting more useful work out of the same hardware through better memory orchestration, more efficient batching, topology-aware placement, and the other systems optimizations this blog covers.
| Position | Moat Source | Durability | Capital Intensity | Vulnerability |
|---|---|---|---|---|
| Quality moat | Model capability advantage | Medium — requires continuous training investment | Very high (training compute) | Open-source models closing quality gap |
| Efficiency moat | Infrastructure cost advantage | High — systems expertise compounds | High (infra expertise + engineering) | New hardware architectures resetting efficiency advantages |
| Commodity position | None | Low — no pricing power at physics floor | Low to medium | Price competition destroys margin indefinitely |
The dangerous position is the commodity middle: a provider with a model that is not measurably better than open-source alternatives, and infrastructure that is not significantly more efficient than the market average. That position has no pricing power at the physics floor and no cost advantage. It exists today because prices have not reached the floor — there is enough margin in the $0.07-0.50/1M range to fund operations without genuine competitive advantage. When prices hit $0.02-0.03, that middle disappears.
9. The infrastructure moat: what it actually requires
The efficiency moat is not achieved by buying newer hardware than your competitors. As established, hardware is commoditized — the same GPUs are available to everyone on the same timeline. The moat comes from systems-level advantages that are harder to replicate than hardware procurement.
Five infrastructure capabilities determine who can operate below the average physics floor:
Memory hierarchy efficiency. The essays on KV cache management, HBM fragmentation, multi-tier residency, and regret-aware eviction describe systems that recover 15-25% of HBM capacity from wasted fragmentation and poor eviction policy. A provider operating with better memory policy genuinely uses less hardware per served request. The advantage compounds: better utilization means fewer GPUs needed, which means lower capital cost per token.
Batching at the physics limit. The cost analysis in the prior essay showed that batch size is the most powerful lever in inference economics — batching to 64 reduces per-token memory cost by 22× versus batch 1. The difference between a provider operating at average batch size 8 and one operating at average batch size 48 is roughly 3-4× in memory cost per token. This requires sophisticated continuous batching, preemption policy, and request scheduling — not just hardware.
Quantization without quality regression. FP8 versus FP16 is roughly 2× reduction in HBM traffic for the same model. FP4, when it preserves quality on the tasks buyers care about, is another 2× reduction. The providers who can push quantization levels while maintaining quality have a direct cost advantage. This requires model-level expertise (which quantization layers matter for quality?) and systems-level expertise (how does the quantization scheme interact with kernel performance?).
Topology and locality optimization. As argued in the network topology essay: topology-aware MoE placement, prefill-decode pairing, and KV routing recover 30-50% of network bandwidth that topology-unaware systems waste. At scale, that 30-50% is directly reflected in interconnect power and latency.
Compiler and kernel quality. The gap between naive PyTorch execution and compiler-optimized kernels is 3-8× in throughput for the same hardware. Providers who invest in compiler infrastructure — custom kernels, kernel fusion, memory layout optimization — operate at fundamentally different throughput than those who rely on off-the-shelf serving libraries.
10. Implications for infrastructure investment decisions today
The convergence of pricing to the physics floor is not abstract. It should be influencing infrastructure investment decisions that are being made right now, eighteen to twenty-four months before the floor is reached.
For providers: The infrastructure investments that will matter at the floor are not hardware — they are systems engineering. Memory orchestration, compiler quality, batching systems, topology-aware scheduling. These systems take 18-24 months to build and harden in production. The window to build them before the floor is reached is narrow. Providers who are still relying on commodity serving infrastructure in mid-2026 will not have production-hardened efficiency systems by the time the floor matters.
For buyers: The $/token price will continue to fall. Optimizing application architecture for current prices — minimizing token consumption, avoiding long contexts, caching aggressively — will look different when tokens cost $0.02/1M than when they cost $2.00/1M. Application architectures that trade token cost for latency, quality, or complexity may become economically viable. The design space changes significantly at the floor.
For hardware vendors: The shift of cost structure toward memory and cooling means that the hardware improvements that matter most for inference economics are not FLOP/s increases — they are HBM bandwidth improvements, packaging innovations that increase HBM capacity per package, and thermal management advances that reduce cooling overhead. The FLOP/s race has run out of load-bearing relevance for inference. The bandwidth and thermal race is what matters.
The time bomb metaphor in this essay's title refers to the compressed timeline in which these dynamics play out. The prices and the floor are converging faster than the systems and strategies needed to operate at the floor are being built. The providers who understand the physics of the floor and are building for it today will be operating efficiently at commodity prices. The ones who do not will find themselves at the floor without the infrastructure to be profitable at it.
Physics sets the floor. Systems engineering determines who can stand on it.