The five roles of NAND in an AI cluster
NAND flash is not "storage" in the traditional sense of a place to park files. In an AI cluster, it plays five distinct roles — each with different performance requirements, access patterns, and failure consequences.
Role 1: Checkpoint absorption
Training large models requires frequent checkpoints — saving the full model state (weights, optimizer state, learning rate schedules) to durable storage. For a 405B parameter model at fp16, a single checkpoint is approximately 810 GB. At fp32 optimizer state, the full checkpoint can exceed 3.2 TB.
The critical metric is not capacity — it is absorption bandwidth. The checkpoint must be written fast enough that the training loop does not stall waiting for storage. On a DGX GB200 compute tray with 4× 3.84 TB E1.S NVMe drives in RAID-0, the sustained sequential write bandwidth is approximately 14 GB/s. A 3.2 TB checkpoint takes ~230 seconds to dump to local flash — during which the GPUs can continue training asynchronously if the checkpoint pipeline is properly staged.
If local flash is too slow, the checkpoint blocks the training loop. Every second of GPU idle time at scale costs real money — on an H100 cluster with 1,024 GPUs, one minute of checkpoint stall costs approximately $500–800 in wasted compute. The SSD's write bandwidth directly determines the checkpoint tax.
Role 2: KV-cache offloading
During inference, the KV cache grows linearly with sequence length. A single 1M-token context at fp16 requires approximately 15 GB of KV memory — more than the entire HBM capacity of many accelerators. When HBM is exhausted, KV pages must be offloaded to a slower tier.
Host DRAM is the first offload target (~64 GB/s via PCIe 5.0). But host DRAM is limited to 0.5–2 TB per node and is shared with the CPU's own workload. Local NVMe becomes the second tier: slower (~7 GB/s sequential, lower for random), but with 15–60 TB of capacity — enough to hold KV caches for hundreds of concurrent long-context requests.
The access pattern for KV offload is fundamentally different from checkpointing. Checkpoints are large sequential writes. KV offload is page-granularity random I/O: individual 4K–64K pages promoted on demand, demoted when capacity pressure exceeds a threshold. This is an IOPS-limited workload, not a bandwidth-limited one. The SSD's random 4K read IOPS — not its sequential bandwidth — determines KV-offload latency.
Role 3: Weight staging and model switching
Inference fleets serve multiple models. Switching between models requires loading new weights into HBM — a process that takes seconds to minutes depending on model size and storage bandwidth. Local NVMe serves as a weight staging area: the next model's weights are pre-fetched from networked storage to local flash, so that the model switch can proceed at NVMe speed rather than network speed.
For a 70B model at fp16 (~140 GB), loading from local NVMe at 7 GB/s takes ~20 seconds. Loading from a 25 Gbps network share would take ~45 seconds. The NVMe staging layer cuts model-switch latency by more than half — and for LoRA adapter hot-swapping (where the adapter is 0.5–2 GB), the switch is nearly instantaneous from local flash.
Role 4: Dataset hydration
Training clusters begin each job by loading datasets from networked storage. For large training runs (multi-TB datasets), this "hydration" phase can take hours if the data must be streamed over the network. Local NVMe serves as a dataset cache: the first epoch reads from the network and writes to local flash simultaneously, so subsequent epochs read entirely from local NVMe at sequential bandwidth.
What current hardware looks like
| Platform | Local NVMe Config | Total Capacity | Form Factor | Sequential Write BW |
|---|---|---|---|---|
| DGX GB200 (compute tray) | 4× 3.84 TB E1.S | 15.4 TB | E1.S (hot-swap) | ~14 GB/s (RAID-0) |
| DGX H100 | 8× 3.84 TB U.2 | 30.7 TB | U.2 NVMe | ~28 GB/s (RAID-0) |
| Custom cloud (2026) | 4–8× 7.68 TB E3.S | 30–60 TB | E3.S (liquid-cooled) | ~28–56 GB/s |
| Samsung PM1763 (2026) | Single drive | Up to 256 TB (Gen 5) | E3.S | ~14 GB/s |
The endurance problem
AI workloads are unusually harsh on NAND endurance. Checkpoint saves generate sustained sequential writes at maximum bandwidth. KV offloading generates random writes at page granularity. Activation spilling creates high-frequency small writes. A single DGX node running continuous training can write 50–100 TB/day to its local flash — hitting the drive's rated endurance (typically 1–3 DWPD for enterprise TLC) within months rather than years.
This is why AI-optimized SSDs increasingly use high-endurance TLC or even SLC-mode caching for the write-hot partition, while using standard TLC or QLC for the read-heavy weight-staging and dataset partitions. The workload is not "enterprise mixed" — it is a bimodal mix of extremely write-heavy (checkpoints) and extremely read-heavy (KV offload, weight loading) access patterns that conventional SSD firmware is not optimized for.
Why "just use the network" doesn't work
The obvious question is: why not skip local flash entirely and use a high-performance networked file system (VAST, Weka, GPFS) for everything? The answer is latency variance. Networked storage adds switch hops, congestion, and tail-latency uncertainty that local NVMe does not. For checkpoint absorption, network storage must compete with training data traffic on the same fabric. For KV offload, the 200–500 μs round-trip to networked flash (vs. 50–100 μs to local NVMe) directly increases token generation latency.
Local flash is not a replacement for networked storage — it is a latency-bounded staging layer that absorbs the spiky, latency-sensitive I/O that the network cannot reliably serve under contention.
The GPU cluster's effective utilization is not determined by peak FLOPS. It is determined by the slowest data path — and increasingly, that path runs through the SSD.