The gap in the memory hierarchy
The current AI memory hierarchy has a structural gap. HBM provides extreme bandwidth (~3.35 TB/s on H100, ~4.8 TB/s on B200) but is limited to 80–192 GB per GPU and costs $10–25 per GB. NVMe SSDs provide massive capacity (up to 256 TB per drive in 2026) but at only 7–14 GB/s sequential bandwidth — roughly 300× slower than HBM. Host DDR5 sits between them in bandwidth (~200 GB/s per socket) but is limited to 1–2 TB and requires PCIe traversal from the GPU.
The result is a 300× bandwidth cliff between the GPU's fast memory and the first storage tier. For inference workloads that are fundamentally read-heavy (weight loading, KV-cache access), this cliff forces expensive multi-GPU parallelism even when a single GPU has more than enough compute.
What HBF actually is
HBF — High Bandwidth Flash — is not a marketing rebrand of faster SSDs. It is a fundamentally different packaging and interface architecture for NAND flash:
- 3D NAND die stacking with TSVs: Similar to how HBM stacks DRAM dies with through-silicon vias, HBF stacks 3D NAND dies into a high-density package. The vertical interconnects eliminate the narrow I/O bottleneck of conventional NAND packages.
- Parallel sub-array architecture: Instead of accessing one page at a time through a narrow bus (as in conventional NAND), HBF activates multiple sub-arrays simultaneously, delivering wide parallel reads that approach DRAM-class bandwidth.
- Near-package or on-interposer placement: Like HBM, HBF is designed to sit physically close to the GPU — on the same interposer or package substrate — rather than across a PCIe link. This eliminates the PCIe serialization overhead that limits conventional NVMe bandwidth.
- Read-optimized: HBF is explicitly designed for read-heavy AI inference workloads. Write endurance and latency are secondary to read bandwidth and read latency — a fundamentally different design tradeoff than general-purpose enterprise SSDs.
The performance numbers
Published simulations from SanDisk show HBF-equipped systems achieving within 2.2% of the performance of an unlimited-capacity HBM system on read-heavy inference workloads. This is remarkable: it means that for the specific access patterns of inference (sequential weight reads, KV-cache lookups), NAND can approach DRAM performance when the packaging and interface are redesigned.
| Memory Tier | Read Bandwidth | Capacity | Cost/GB | Interface | Latency |
|---|---|---|---|---|---|
| HBM3e | 3.35–4.8 TB/s | 80–192 GB | $10–25 | On-package TSV | ~100 ns |
| HBF (projected) | ~1–2 TB/s | 0.5–3 TB | $1–3 | Near-package TSV | ~1–5 μs |
| CXL DDR5 | 36–72 GB/s | 128 GB–1 TB | $3–6 | CXL 2.0/3.0 | ~200–400 ns |
| NVMe Gen 5 | 7–14 GB/s | 4–256 TB | $0.08–0.15 | PCIe Gen 5 | ~50–100 μs |
Why inference, not training
HBF is explicitly positioned for inference, not training. The reason is write endurance. Training workloads generate enormous write volumes — checkpoint saves, gradient accumulation, optimizer state updates — that would exhaust NAND endurance quickly. Inference workloads are overwhelmingly read-heavy: weight loading is a one-time sequential read, and KV-cache access is read-dominant with limited writes (only when new tokens extend the cache).
This read/write asymmetry is the key insight. NAND flash reads are fundamentally faster than NAND writes — and the performance gap widens with advanced 3D NAND architectures. HBF exploits this by optimizing the entire packaging, controller, and interface for read throughput, accepting lower write performance as a design tradeoff.
What HBF changes for inference architecture
Model parallelism becomes optional for capacity
Today, a 405B model at fp16 requires at least 5–6 H100 GPUs purely for weight capacity (810 GB ÷ ~140 GB usable HBM per GPU). With 1–3 TB of HBF per GPU, the same model could fit on a single GPU for weight storage — with tensor parallelism needed only for compute throughput, not memory capacity. This decouples the compute-scaling decision from the memory-capacity decision.
KV-cache pressure drops dramatically
The KV cache for a 1M-token context at fp16 is ~15 GB. With 80 GB HBM, that leaves 65 GB for weights and other state — tight for large models. With 1 TB of HBF, the KV cache can spill into a tier that is 100× faster than NVMe and 20× closer to the GPU than host DRAM. The latency penalty for a KV-cache miss to HBF (~1–5 μs) is an order of magnitude lower than a miss to host DRAM via PCIe (~200 μs).
The hybrid memory controller
HBF requires a new kind of memory controller — one that manages both DRAM (HBM) and NAND (HBF) tiers transparently. Data placement decisions (which weights stay in HBM, which spill to HBF) must be made by the controller based on access frequency, latency sensitivity, and the current-step's compute requirements. This is essentially the same eviction-and-residency problem that software-level KV-cache policies solve today — but pushed down into hardware.
The industry timeline
- H2 2026: SanDisk delivers first HBF samples to hyperscaler customers for evaluation.
- 2026–2027: SK Hynix and Samsung advance competing approaches — SK Hynix through AI-N (AI-optimized NAND) product line, Samsung through 7th-generation Z-NAND "Memory Class Storage."
- 2027: First commercial AI-inference devices with HBF expected. Integration with next-generation GPU architectures (post-Blackwell) likely requires co-design with NVIDIA or AMD.
- 2028+: Potential standardization of HBF interfaces, enabling a competitive supplier ecosystem similar to HBM's JEDEC standardization.
What remains uncertain
- Write endurance in practice. Even inference workloads generate some writes (KV-cache insertion, weight updates for online learning). Whether HBF's NAND endurance is sufficient for mixed inference workloads over a 3–5 year server life remains unvalidated at scale.
- Controller complexity. A hybrid HBM+HBF memory controller is a new design challenge. Incorrect data placement decisions at the hardware level cannot be corrected by software — latency penalties are permanent per access.
- Thermal constraints. Placing NAND dies near a 700W GPU creates thermal challenges. NAND performance degrades above 85°C, and GPU packages routinely exceed 100°C at the die surface. Thermal isolation between the NAND stack and the GPU die requires careful interposer design.
- Cost trajectory. HBF's value proposition depends on maintaining a 5–10× cost-per-GB advantage over HBM. If HBM costs decline faster than expected (through yield improvements or competition), HBF's economic case weakens.
HBF is not faster HBM. It is cheaper, larger, and optimized for reads. The question is whether "cheaper, larger, and read-optimized" is enough to change the economics of inference — and the early numbers suggest it is.