← All writings
Google I/O 2026 · Systems Deep Dive

The Quadrillion-Token Era Has Arrived

At Google I/O 2026, Sundar Pichai didn't lead with a new model name or a benchmark score. He led with a number: 3.2 quadrillion tokens per month. That contrast — product metrics, not research metrics — is the signal.

The strategic implication is bigger than raw AI usage: inference is becoming a planetary-scale memory-orchestration problem.

// The bottleneck has moved from FLOPs to memory bandwidth, capacity, and placement. We are no longer compute-bound. We are state-bound.
May 2024
9.7T
tokens/month across Google surfaces
May 2025
480T
49.5× growth year-over-year
May 2026
3.2Q+
~7× again — the compounding is real

Token volume is compounding faster than infrastructure comfort.

log-scale intuition
2024
9.7T
2025
480T
2026
3.2Q+

Tokens are now a first-class unit of infrastructure demand — not an approximation of model activity, but the load metric itself.

Users 1.23B tok/s Inference runtime KV Cache SRAM HBM DRAM CXL NVMe fastest bandwidth capacity fabric cold KV

Cost per token fell. Demand exploded.

Public API pricing is not Google's internal cost — but it proxies the economic direction. Jevons' paradox at planetary scale: cheaper tokens create more token demand, not less.

Proxy (Gemini API)Approx public price
Low-cost input tokens~$0.10 / 1M tokens
Low-cost output tokens~$0.40 / 1M tokens
Higher-tier input tokens~$1.50+ / 1M tokens
Higher-tier output tokens~$9.00+ / 1M tokens

At 3.2Q tokens/month, a $0.01 difference in cost per million tokens equals $32M/month. Efficiency is a first-order financial variable, not a footnote.

The hidden unit is not the text token.

A token is a few bytes as text. But during autoregressive inference, each token creates attention state across all layers — the KV cache. That state is the real infrastructure unit.

KV bytes/token = 2 × layers × KV_heads × head_dim × bytes_per_value

The "2" is for keys and values. Critically, this footprint scales linearly with sequence length N. A 64K-token context window carries 64,000× more KV residency than a single token. Long-context agent sessions are memory multipliers, not just "longer requests."

We are no longer in the GPU shortage era. We are in the memory residency era — and most infrastructure teams aren't ready for what that implies.

How much RAM does one token actually consume?

Exact numbers depend on architecture, precision, layers, KV heads, head dimension, compression, and attention style (MHA vs GQA vs MQA):

Model / attention styleApprox KV RAM per tokenWhy it compounds fast
Small Flash-like efficient model32–64 KBStill 40+ PB live at 5-minute windows at Google-scale
8B-class GQA model~128 KBPlanning anchor; ~9.4 PB resident at a 60-second window
70B-class GQA model~320 KBStresses HBM capacity within minutes at this throughput
Non-optimized large MHA model512 KB – 1 MBBecomes capacity-bound faster than compute-bound at any load

Because footprint is linear in sequence length, a model serving 64K-token agent sessions carries 500–2000× more KV residency than the same model serving short chat completions. Long-context is a memory multiplier with no ceiling in sight.

Convert 3.2Q/month into tokens per second.

3.2 × 10¹&sup5; ÷ (30 × 24 × 3600) ≈ 1.23 × 10&sup9; tokens/sec

Roughly 1.23 billion tokens per second, continuously, averaged across the month. The average hides the real problem.

Peak traffic is materially higher. Real systems must provision for burstiness, geography, model tiering, product-specific latency SLOs, and traffic spikes. The burst provisioning budget is where memory costs spike fastest.

Live memory, not lifetime tokens, is the bottleneck.

Google does not need all monthly tokens resident forever. The systems problem is how many tokens are live simultaneously — across active conversations, agent threads, and multimodal streams.

Live KV memory = token rate × live residency window × KV bytes/token

The residency window is set by latency targets, context lengths, and session lifetimes — not business logic. It is a systems parameter, and it is growing with every new product feature.

Estimated live RAM at today's Google-scale token rate

Using 1.23B tokens/sec as the average rate. Systems estimates for infrastructure thinking — not claimed Google internal numbers.

KV per token60 sec live window5 min live windowRepresentative model class
64 KB~4.7 PB~23.6 PBSmall / Flash-style
128 KB~9.4 PB~47.2 PB8B-class GQA (planning anchor)
320 KB~23.6 PB~118 PB70B-class GQA
512 KB~37.8 PB~189 PBLarge non-optimized MHA

Math check: 1.23 × 10&sup9; × 60 × 131,072 ≈ 9.67 × 10¹&sup5; bytes (decimal) ≈ 9.4 PB (binary). Numbers hold under peer review; variance is rounding convention only.

The next AI bottleneck is not only FLOPS.
It is memory residency, movement, and orchestration.

The token lifecycle is splitting at scale.

At quadrillion-token throughput, monolithic inference clusters give way to disaggregated architectures — because Prefill and Decode have fundamentally different resource profiles and can no longer share the same cluster efficiently.

Prefill Stage
Process the Prompt
Parallel attention over all input tokens simultaneously. Runs once per request.
compute-boundhigh FLOPS utilizationparallelizable
Decode Stage
Generate Each Token
Autoregressive, one token at a time. KV cache grows with every step generated.
memory-bandwidth-boundKV cache hot pathlatency-sensitive

You cannot optimize both stages on the same cluster. The industry is moving toward dedicated Prefill clusters (FLOPS-dense) and dedicated Decode clusters (memory-optimized, high-bandwidth HBM). KV state is handed off between them: a low-latency transfer of hundreds of megabytes per active long-context session.

The quadrillion-token era doesn't just stress hardware. It forces a different cluster topology — one the industry is still actively designing.

The new inference memory hierarchy

HBM capacity scaling cannot keep pace with context window growth. The runtime must treat local accelerator memory as a cache for a massive, distributed, multi-tiered memory fabric.

Tensor SRAM / Registers
On-chip, nanosecond access — the hot path
fastest
HBM (High Bandwidth Memory)
2–3 TB/s bandwidth, but capacity-limited (~80 GB/chip)
bandwidth
Host DRAM
Slower but 4–8× more capacity per accelerator node
capacity
CXL Pooled Memory
Shared fabric across nodes — scale beyond local memory
fabric scale
NVMe / SSD Spill
Cold KV cache, prefix archives, evicted sessions
cold KV

What systems will need at this scale

GQA / MQAReduce KV head duplication; 4–8× memory savings
PagedAttentionVirtualize KV-cache allocation like OS paging
Prefix cachingReuse shared prompt state across requests
KV quantizationINT8/FP8 compression; 2–4× footprint reduction
CXL poolingScale KV capacity beyond per-node HBM
Semantic evictionKeep reusable state; drop ephemeral state intelligently
Disaggregated servingSplit Prefill and Decode by resource profile

What this means if you're serving models in 2026

This analysis is not just about Google. The same physics applies at any scale — and the architectural choices compound faster than the hardware roadmap.

HBM4 and beyond

HBM4 roughly doubles memory bandwidth over HBM3e. But context windows are growing faster. By the time HBM4 ships at scale, 1M-token contexts will be common — and at 128KB/token, that's 128 GB of KV state per session. Bandwidth is necessary but not sufficient. Capacity is the new wall.

Disaggregated memory serving

The correct mental model is no longer "GPU cluster." It's a multi-tier memory fabric with compute attached. Prefill clusters handle prompt processing; Decode clusters manage live KV state; CXL fabrics pool memory across nodes. Designing without this split leaves FLOPs-per-dollar on the table and memory bandwidth as a hard ceiling.

Token warehouses

Long-running agents, persistent memory, and shared system prompts are creating a new primitive: the token warehouse — a persistent, queryable store of KV state that survives across sessions. This is not a cache. It's a database. And its access patterns (hot/warm/cold, semantic eviction, prefix reuse) look more like a storage engine than a GPU kernel.

Extrapolation: if 7× yearly growth continued

Not a forecast — a stress test to surface what must be solved before the numbers arrive.

YearTokens/monthAvg tokens/secLive KV @ 128KB, 60s windowWhat it forces
20263.2Q1.23B/s~9.4 PBMulti-node HBM pooling
202722.4Q8.64B/s~66 PBCXL fabric at datacenter scale
2028156.8Q60.5B/s~465 PBToken warehouses, semantic eviction, new memory media

At the 2028 row, memory placement, compression, DMA scheduling, and workload-aware memory fabrics are not optimization targets. They are the critical path.

Compiler-controlled execution

Static scheduling reduces runtime chaos by preplanning memory movement and buffer reuse — treating memory traffic as a first-class compilation target, not an afterthought added in the optimizer.

KV-aware memory controllers

Next-generation controllers may need to understand inference state semantics — distinguishing hot reusable state, streaming decode state, and ephemeral one-shot state — to make intelligent placement decisions in hardware.

OS / runtime co-design

The boundary between OS, accelerator runtime, interconnect, and memory fabric is blurring. The systems software stack for a trillion-parameter inference cluster in 2028 will look nothing like today's CUDA + PyTorch + Linux.

Conclusion

Google's 3.2 quadrillion tokens/month milestone is not an AI adoption metric. It is a systems warning. Inference at this scale is no longer about feeding tensor cores — it is about deciding which memory state should stay close, which should move, which should be compressed, which should be evicted, and which should be reused across sessions.

Prefill/Decode disaggregation, the emergence of token warehouses, and the rise of CXL memory fabrics are not incremental improvements to the GPU-centric architecture of 2020. They are a different architecture entirely — one the industry is building in real time.

The AI era is shifting from compute-centric scaling to memory-orchestration-centric scaling.

The next 10× won't come from bigger models.

It will come from killing the memory wall.

The teams who treat memory as the first-order constraint — not an optimization to revisit later — will define what production AI infrastructure looks like in 2028.

Sources and notes

  • Google I/O 2026 official post — Sundar Pichai reporting 3.2Q tokens/month, with historical milestones of 9.7T (2024) and ~480T (2025): Google Blog.
  • Business Insider coverage of Google I/O 2026 AI usage numbers: Business Insider.
  • Gemini Developer API pricing used as a public market proxy for token economics, not as Google internal infrastructure cost: Gemini API Pricing.
  • KV-cache calculations are architecture estimates based on the standard transformer KV-cache formula. Actual internal systems vary by model, precision, batching, prefix caching, compression, eviction policy, and hardware topology.
  • Math verification: 1.23 × 10&sup9; × 60 × 131,072 ≈ 9.67 × 10¹&sup5; bytes (decimal PB) ≈ 9.4 PB (binary). Numbers consistent within rounding conventions.
  • Prefill/Decode disaggregation references: Zhong et al. "DistServe" (2024), Patel et al. "Splitwise" (2024), and ongoing production deployments across major inference providers.