← All posts
Memory Architecture · NAND · AI Inference · Emerging Technology

High Bandwidth Flash: The Missing Tier

Published · 7 min read

HBM is fast but small and expensive. NVMe is large but slow. HBF (High Bandwidth Flash) is the NAND-based memory tier designed to sit between them — offering 8–16× the capacity of HBM at within 2.2% of its read performance for inference workloads.

By Manish KL ~18 min read Technical Essay
Abstract

The "memory wall" in AI inference is not just about bandwidth — it is about capacity. HBM3e on an H100 provides 80 GB at 3.35 TB/s. But a single 405B model at fp16 requires ~810 GB of weight storage alone — before any KV cache is allocated. The industry has been forced to distribute models across multiple GPUs not because of compute limits, but because of memory limits. High Bandwidth Flash (HBF) aims to close this gap by bringing NAND flash physically close to the GPU, using TSV stacking and parallel sub-array architectures to deliver HBM-competitive read bandwidth at NAND cost per bit. First samples arrive H2 2026; commercial deployment targets 2027. This essay examines the physics, the performance, the economics, and the systems implications.

8–16×
capacity advantage over HBM at comparable cost
2.2%
performance gap vs unlimited HBM on inference (simulated)
H2 2026
first HBF samples from SanDisk
2027
commercial AI-inference devices with HBF expected

The gap in the memory hierarchy

The current AI memory hierarchy has a structural gap. HBM provides extreme bandwidth (~3.35 TB/s on H100, ~4.8 TB/s on B200) but is limited to 80–192 GB per GPU and costs $10–25 per GB. NVMe SSDs provide massive capacity (up to 256 TB per drive in 2026) but at only 7–14 GB/s sequential bandwidth — roughly 300× slower than HBM. Host DDR5 sits between them in bandwidth (~200 GB/s per socket) but is limited to 1–2 TB and requires PCIe traversal from the GPU.

The result is a 300× bandwidth cliff between the GPU's fast memory and the first storage tier. For inference workloads that are fundamentally read-heavy (weight loading, KV-cache access), this cliff forces expensive multi-GPU parallelism even when a single GPU has more than enough compute.

The Memory Hierarchy Gap That HBF Fills ← smaller, faster, costlier larger, cheaper, slower → HBM3e 80–192 GB 3.35–4.8 TB/s $10–25/GB on-package, TSV fast, small, expensive HBF ★ 0.5–3 TB (projected) ~1–2 TB/s (projected) $1–3/GB (projected) near-package, TSV NAND fills the 300× gap H2 2026 samples Host DDR5 0.5–2 TB ~200 GB/s $2–5/GB PCIe traversal required CPU-attached, shared NVMe SSD 4–256 TB 7–14 GB/s $0.08–0.15/GB PCIe, block interface huge, cheap, slow
Figure 1. The memory hierarchy gap. Between HBM (3.35 TB/s, 80 GB) and NVMe (7–14 GB/s, 60+ TB), there is a 300× bandwidth cliff. HBF is designed to fill this gap with NAND-based capacity at near-HBM read bandwidth.

What HBF actually is

HBF — High Bandwidth Flash — is not a marketing rebrand of faster SSDs. It is a fundamentally different packaging and interface architecture for NAND flash:

The performance numbers

Published simulations from SanDisk show HBF-equipped systems achieving within 2.2% of the performance of an unlimited-capacity HBM system on read-heavy inference workloads. This is remarkable: it means that for the specific access patterns of inference (sequential weight reads, KV-cache lookups), NAND can approach DRAM performance when the packaging and interface are redesigned.

Memory TierRead BandwidthCapacityCost/GBInterfaceLatency
HBM3e3.35–4.8 TB/s80–192 GB$10–25On-package TSV~100 ns
HBF (projected)~1–2 TB/s0.5–3 TB$1–3Near-package TSV~1–5 μs
CXL DDR536–72 GB/s128 GB–1 TB$3–6CXL 2.0/3.0~200–400 ns
NVMe Gen 57–14 GB/s4–256 TB$0.08–0.15PCIe Gen 5~50–100 μs

Why inference, not training

HBF is explicitly positioned for inference, not training. The reason is write endurance. Training workloads generate enormous write volumes — checkpoint saves, gradient accumulation, optimizer state updates — that would exhaust NAND endurance quickly. Inference workloads are overwhelmingly read-heavy: weight loading is a one-time sequential read, and KV-cache access is read-dominant with limited writes (only when new tokens extend the cache).

This read/write asymmetry is the key insight. NAND flash reads are fundamentally faster than NAND writes — and the performance gap widens with advanced 3D NAND architectures. HBF exploits this by optimizing the entire packaging, controller, and interface for read throughput, accepting lower write performance as a design tradeoff.

What HBF changes for inference architecture

Model parallelism becomes optional for capacity

Today, a 405B model at fp16 requires at least 5–6 H100 GPUs purely for weight capacity (810 GB ÷ ~140 GB usable HBM per GPU). With 1–3 TB of HBF per GPU, the same model could fit on a single GPU for weight storage — with tensor parallelism needed only for compute throughput, not memory capacity. This decouples the compute-scaling decision from the memory-capacity decision.

KV-cache pressure drops dramatically

The KV cache for a 1M-token context at fp16 is ~15 GB. With 80 GB HBM, that leaves 65 GB for weights and other state — tight for large models. With 1 TB of HBF, the KV cache can spill into a tier that is 100× faster than NVMe and 20× closer to the GPU than host DRAM. The latency penalty for a KV-cache miss to HBF (~1–5 μs) is an order of magnitude lower than a miss to host DRAM via PCIe (~200 μs).

The hybrid memory controller

HBF requires a new kind of memory controller — one that manages both DRAM (HBM) and NAND (HBF) tiers transparently. Data placement decisions (which weights stay in HBM, which spill to HBF) must be made by the controller based on access frequency, latency sensitivity, and the current-step's compute requirements. This is essentially the same eviction-and-residency problem that software-level KV-cache policies solve today — but pushed down into hardware.

The architectural shift: HBF doesn't just add capacity — it moves the residency decision from software (vLLM, TensorRT-LLM) to hardware (memory controller). The controller must decide, at microsecond granularity, whether a weight tensor or KV page should be served from HBM or HBF. This is a hardware-level version of the eviction policy problem.

The industry timeline

What remains uncertain

HBF is not faster HBM. It is cheaper, larger, and optimized for reads. The question is whether "cheaper, larger, and read-optimized" is enough to change the economics of inference — and the early numbers suggest it is.