The Storage Geometry of a 100,000-GPU Cluster: How Much NAND Does AI Actually Need?

Abstract

When Meta, Google, or xAI builds a 100,000-GPU training cluster, the GPU count makes headlines. The storage behind it does not. But these clusters consume staggering quantities of NAND flash — 1.5+ exabytes of local NVMe alone, plus multi-exabyte networked flash for checkpoints, datasets, and model repositories. This essay provides a bottom-up accounting of flash capacity in a representative hyperscale AI deployment, traces where the capacity actually goes, and examines how AI-driven NAND demand is reshaping the global flash supply chain.

The bottom-up accounting

Let's build the storage budget for a hypothetical 100,000-GPU cluster based on DGX GB200 architecture. This is a simplified but representative model of what hyperscale AI deployments actually look like in 2026.

Storage Tier	Per Node	Per 1,000 GPUs	Per 100,000 GPUs	NAND Type
Local NVMe (hot)	15.4 TB	3.9 PB	385 PB	TLC (high endurance)
Local NVMe (warm staging)	~30 TB (optional)	7.5 PB	750 PB	TLC/QLC mixed
Networked flash (checkpoints)	shared	~5 PB	~500 PB	QLC (capacity-optimized)
Dataset lake (flash tier)	shared	~10 PB	~1 EB	QLC (read-optimized)
Model repository	shared	~0.5 PB	~50 PB	QLC

Total NAND footprint: A 100,000-GPU cluster consumes approximately 2.5–3 exabytes of NAND flash across all tiers. At current enterprise SSD pricing ($0.08–0.15/GB), the flash bill alone is $200–450 million — a figure that is rarely discussed alongside the GPU cost.

Where the capacity goes

Local NVMe: the latency-bounded staging layer

Each DGX GB200 compute tray ships with 4× 3.84 TB E1.S NVMe drives — approximately 15.4 TB of local flash. In a 100,000-GPU cluster with ~25,000 compute trays, this totals 385 PB of local NVMe. This capacity serves the latency-sensitive roles: checkpoint absorption, KV-cache offload, weight staging, and dataset hydration.

At scale, operators increasingly add a second tier of warm staging — 4× 7.68 TB or 8× 7.68 TB drives per node — for model swapping and multi-model inference. This can push per-node local flash to 30–60 TB, and total cluster local flash toward 750 PB–1.5 EB.

Checkpoint storage: the write-heavy reservoir

A 405B model checkpoint at full optimizer state is ~3.2 TB. During a training run that checkpoints every 15 minutes, the cluster generates approximately 13 TB of checkpoint data per hour. Over a 90-day training run, that is ~28 PB of cumulative checkpoint writes — though older checkpoints are pruned, so the steady-state reservoir is typically 5–10 most recent checkpoints (16–32 PB).

Checkpoint storage is typically served by a high-performance parallel file system (VAST, Weka, GPFS) backed by all-flash NVMe arrays. These systems require high sequential write bandwidth (to absorb checkpoint bursts) and high sequential read bandwidth (to recover from failures). The NAND underneath is typically QLC for cost efficiency, with TLC caching for write burst absorption.

The dataset lake

Training datasets for frontier models range from 10–100+ TB of tokenized data. Multi-modal training (text + images + video + code) pushes this toward petabyte scale. The flash tier of the dataset lake serves hot partitions — the data being actively consumed by the current training epoch — while cold partitions reside on object storage (S3, GCS) or tape.

Figure 1. Storage stack of a 100K-GPU AI cluster. Local NVMe dominates capacity by count; the dataset lake dominates by volume. Total NAND footprint approaches 3 exabytes.

The NAND supply pressure

A single 100,000-GPU cluster consuming 2.5–3 EB of NAND represents a significant fraction of global NAND production. The global NAND market produced approximately 350–400 EB of raw capacity in 2025, with enterprise SSD shipments accounting for about 30–40% of that. Three or four hyperscale AI clusters of this size would consume 7–12 EB — or 5–8% of total global enterprise NAND supply.

This concentration of demand is already reshaping the NAND market. SK Hynix, Samsung, and Kioxia have shifted manufacturing priorities toward high-density enterprise QLC (321-layer and beyond) specifically to serve AI capacity needs. The enterprise SSD product lines are bifurcating into "AI-optimized" (read-heavy, high-IOPS, moderate endurance) and "general enterprise" (balanced read/write, standard endurance) — a segmentation driven entirely by the distinctive access patterns of AI workloads.

The form factor evolution

AI cluster storage is driving a form factor transition from the legacy 2.5" U.2 and M.2 drives toward the E1.S and E3.S EDSFF (Enterprise and Data Center SSD Form Factor) standards:

Form Factor	Typical Capacity	Thermal Design	Key Advantage for AI
M.2 2280	1–4 TB	Passive (board-mounted)	Boot drives only; too slow for data path
U.2 (2.5")	4–30 TB	Air-cooled chassis	Legacy; used in DGX H100. High-capacity but poor thermal density
E1.S	4–16 TB	Hot-swap, direct airflow	DGX GB200 standard. Excellent thermal management, hot-swappable
E3.S	8–64 TB	Direct-attach, liquid-cooling compatible	Next-gen AI: higher capacity, designed for 50–100 kW rack densities

The transition to E1.S and E3.S is driven by thermal necessity: AI racks at 50–100 kW density cannot provide sufficient airflow for 2.5" drives deep inside the chassis. E1.S and E3.S drives are designed with direct-airflow or conduction-cooling paths that maintain NAND operating temperatures below 85°C even in high-density, liquid-cooled configurations.

The 256 TB drive and what it means

Samsung's PM1763 — a PCIe Gen 5 E3.S drive shipping in 2026 — offers up to 256 TB in a single drive, with 512 TB Gen 6 variants planned for 2027. SK Hynix is demonstrating 245 TB eSSD prototypes using 321-layer QLC NAND. These capacities were unthinkable five years ago and are being driven almost entirely by AI demand.

A single 256 TB drive can hold approximately 160 copies of a 70B model at fp16, or the complete checkpoint history of a 405B training run. At scale, this means a checkpoint reservoir that previously required 100+ drives can be served by a handful. The implications for rack density, cabling, and failure domains are significant.

The concentration risk: A single 256 TB drive failure loses 256 TB of data. The mean time to rebuild a 256 TB RAID set at NVMe speeds (~7 GB/s) is approximately 10 hours — during which the array is vulnerable to a second failure. AI storage architects must balance capacity per drive against rebuild time and failure-domain isolation.

Where this is heading

By 2028, a 500,000-GPU "frontier cluster" will likely require 10–15 EB of total NAND capacity — approaching 3–4% of projected global annual NAND production. This creates genuine supply-chain exposure: a single hyperscaler's cluster build can measurably tighten the global NAND market, driving price increases that ripple through consumer electronics, enterprise IT, and automotive sectors.

The storage industry's response is threefold: higher-density NAND (300+ layer V-NAND with QLC and PLC/5-bit-per-cell), AI-optimized SSD controllers (100M+ IOPS, GPUDirect Storage support, direct GPU-to-flash data paths), and new memory tiers (HBF, CXL-attached flash) that blur the line between storage and memory. The GPU gets the headlines, but the NAND gets the purchase orders.

AI clusters are the largest concentrated consumers of NAND flash in history. The GPU determines what work gets done. The flash determines whether the GPU stays fed. Increasingly, the binding constraint is the SSD.

storage-geometry-100k-gpu-cluster-nand-demand.html · April 2026 · ← All writings