High Bandwidth Flash: The Missing Tier Between HBM and NVMe in AI Inference

Abstract

The "memory wall" in AI inference is not just about bandwidth — it is about capacity. HBM3e on an H100 provides 80 GB at 3.35 TB/s. But a single 405B model at fp16 requires ~810 GB of weight storage alone — before any KV cache is allocated. The industry has been forced to distribute models across multiple GPUs not because of compute limits, but because of memory limits. High Bandwidth Flash (HBF) aims to close this gap by bringing NAND flash physically close to the GPU, using TSV stacking and parallel sub-array architectures to deliver HBM-competitive read bandwidth at NAND cost per bit. First samples arrive H2 2026; commercial deployment targets 2027. This essay examines the physics, the performance, the economics, and the systems implications.

The gap in the memory hierarchy

The current AI memory hierarchy has a structural gap. HBM provides extreme bandwidth (~3.35 TB/s on H100, ~4.8 TB/s on B200) but is limited to 80–192 GB per GPU and costs $10–25 per GB. NVMe SSDs provide massive capacity (up to 256 TB per drive in 2026) but at only 7–14 GB/s sequential bandwidth — roughly 300× slower than HBM. Host DDR5 sits between them in bandwidth (~200 GB/s per socket) but is limited to 1–2 TB and requires PCIe traversal from the GPU.

The result is a 300× bandwidth cliff between the GPU's fast memory and the first storage tier. For inference workloads that are fundamentally read-heavy (weight loading, KV-cache access), this cliff forces expensive multi-GPU parallelism even when a single GPU has more than enough compute.

Figure 1. The memory hierarchy gap. Between HBM (3.35 TB/s, 80 GB) and NVMe (7–14 GB/s, 60+ TB), there is a 300× bandwidth cliff. HBF is designed to fill this gap with NAND-based capacity at near-HBM read bandwidth.

What HBF actually is

HBF — High Bandwidth Flash — is not a marketing rebrand of faster SSDs. It is a fundamentally different packaging and interface architecture for NAND flash:

3D NAND die stacking with TSVs: Similar to how HBM stacks DRAM dies with through-silicon vias, HBF stacks 3D NAND dies into a high-density package. The vertical interconnects eliminate the narrow I/O bottleneck of conventional NAND packages.
Parallel sub-array architecture: Instead of accessing one page at a time through a narrow bus (as in conventional NAND), HBF activates multiple sub-arrays simultaneously, delivering wide parallel reads that approach DRAM-class bandwidth.
Near-package or on-interposer placement: Like HBM, HBF is designed to sit physically close to the GPU — on the same interposer or package substrate — rather than across a PCIe link. This eliminates the PCIe serialization overhead that limits conventional NVMe bandwidth.
Read-optimized: HBF is explicitly designed for read-heavy AI inference workloads. Write endurance and latency are secondary to read bandwidth and read latency — a fundamentally different design tradeoff than general-purpose enterprise SSDs.

The performance numbers

Published simulations from SanDisk show HBF-equipped systems achieving within 2.2% of the performance of an unlimited-capacity HBM system on read-heavy inference workloads. This is remarkable: it means that for the specific access patterns of inference (sequential weight reads, KV-cache lookups), NAND can approach DRAM performance when the packaging and interface are redesigned.

Memory Tier	Read Bandwidth	Capacity	Cost/GB	Interface	Latency
HBM3e	3.35–4.8 TB/s	80–192 GB	$10–25	On-package TSV	~100 ns
HBF (projected)	~1–2 TB/s	0.5–3 TB	$1–3	Near-package TSV	~1–5 μs
CXL DDR5	36–72 GB/s	128 GB–1 TB	$3–6	CXL 2.0/3.0	~200–400 ns
NVMe Gen 5	7–14 GB/s	4–256 TB	$0.08–0.15	PCIe Gen 5	~50–100 μs

Why inference, not training

HBF is explicitly positioned for inference, not training. The reason is write endurance. Training workloads generate enormous write volumes — checkpoint saves, gradient accumulation, optimizer state updates — that would exhaust NAND endurance quickly. Inference workloads are overwhelmingly read-heavy: weight loading is a one-time sequential read, and KV-cache access is read-dominant with limited writes (only when new tokens extend the cache).

This read/write asymmetry is the key insight. NAND flash reads are fundamentally faster than NAND writes — and the performance gap widens with advanced 3D NAND architectures. HBF exploits this by optimizing the entire packaging, controller, and interface for read throughput, accepting lower write performance as a design tradeoff.

What HBF changes for inference architecture

Model parallelism becomes optional for capacity

Today, a 405B model at fp16 requires at least 5–6 H100 GPUs purely for weight capacity (810 GB ÷ ~140 GB usable HBM per GPU). With 1–3 TB of HBF per GPU, the same model could fit on a single GPU for weight storage — with tensor parallelism needed only for compute throughput, not memory capacity. This decouples the compute-scaling decision from the memory-capacity decision.

KV-cache pressure drops dramatically

The KV cache for a 1M-token context at fp16 is ~15 GB. With 80 GB HBM, that leaves 65 GB for weights and other state — tight for large models. With 1 TB of HBF, the KV cache can spill into a tier that is 100× faster than NVMe and 20× closer to the GPU than host DRAM. The latency penalty for a KV-cache miss to HBF (~1–5 μs) is an order of magnitude lower than a miss to host DRAM via PCIe (~200 μs).

The hybrid memory controller

HBF requires a new kind of memory controller — one that manages both DRAM (HBM) and NAND (HBF) tiers transparently. Data placement decisions (which weights stay in HBM, which spill to HBF) must be made by the controller based on access frequency, latency sensitivity, and the current-step's compute requirements. This is essentially the same eviction-and-residency problem that software-level KV-cache policies solve today — but pushed down into hardware.

The architectural shift: HBF doesn't just add capacity — it moves the residency decision from software (vLLM, TensorRT-LLM) to hardware (memory controller). The controller must decide, at microsecond granularity, whether a weight tensor or KV page should be served from HBM or HBF. This is a hardware-level version of the eviction policy problem.

The industry timeline

H2 2026: SanDisk delivers first HBF samples to hyperscaler customers for evaluation.
2026–2027: SK Hynix and Samsung advance competing approaches — SK Hynix through AI-N (AI-optimized NAND) product line, Samsung through 7th-generation Z-NAND "Memory Class Storage."
2027: First commercial AI-inference devices with HBF expected. Integration with next-generation GPU architectures (post-Blackwell) likely requires co-design with NVIDIA or AMD.
2028+: Potential standardization of HBF interfaces, enabling a competitive supplier ecosystem similar to HBM's JEDEC standardization.

What remains uncertain

Write endurance in practice. Even inference workloads generate some writes (KV-cache insertion, weight updates for online learning). Whether HBF's NAND endurance is sufficient for mixed inference workloads over a 3–5 year server life remains unvalidated at scale.
Controller complexity. A hybrid HBM+HBF memory controller is a new design challenge. Incorrect data placement decisions at the hardware level cannot be corrected by software — latency penalties are permanent per access.
Thermal constraints. Placing NAND dies near a 700W GPU creates thermal challenges. NAND performance degrades above 85°C, and GPU packages routinely exceed 100°C at the die surface. Thermal isolation between the NAND stack and the GPU die requires careful interposer design.
Cost trajectory. HBF's value proposition depends on maintaining a 5–10× cost-per-GB advantage over HBM. If HBM costs decline faster than expected (through yield improvements or competition), HBF's economic case weakens.

HBF is not faster HBM. It is cheaper, larger, and optimized for reads. The question is whether "cheaper, larger, and read-optimized" is enough to change the economics of inference — and the early numbers suggest it is.

high-bandwidth-flash-hbf-missing-tier-ai-inference.html · April 2026 · ← All writings