01Why the Next Bottleneck Is Inevitable
Every hardware era produces a bottleneck that constrains the era that follows it. In the PC era, CPUs starved for cache. In mobile, power walls constrained clock speeds. In the deep learning era, the GPU's on-die compute ran ahead of the memory bandwidth that could feed it. HBM — High Bandwidth Memory — was the answer to that: DRAM dies stacked vertically, connected through silicon interposers, delivering terabytes-per-second to hungry accelerators.
HBM solved the bandwidth problem for training. But training is only half the story. The other half — inference — has a fundamentally different workload shape. When a model reasons over long contexts, manages multi-turn memory, or runs agentic chains of tasks, it generates enormous quantities of intermediate state: KV caches, conversation histories, chain-of-thought traces, tool outputs. This data isn't small. At 200K token contexts with a 70B-parameter model, the KV cache alone can consume dozens of gigabytes per request.
"What determines speed is HBM. What determines capacity is HBF."
— Prof. Kim Jung-ho, KAIST School of Electrical and Electronic Engineering, Feb 2026HBM's capacity ceiling — typically 80–160 GB on the largest accelerator packages today — is insufficient to hold this explosion of context state. Worse, HBM is priced at a premium that makes using it for bulk context storage economically indefensible. Enter HBF: a new memory tier built from NAND flash, stacked with near-memory bandwidth ambitions, designed to hold what HBM cannot.
02Anatomy of HBF: How It's Built
HBF takes the packaging philosophy of HBM and applies it to NAND flash instead of DRAM. The result is a fundamentally different device with different physics, different access patterns, and different failure modes — but packaged in a way that can sit on the same silicon substrate as an AI accelerator.
Key structural differences from HBM
NAND flash is fundamentally non-volatile: data survives power loss. It stores charge in floating-gate or charge-trap transistors, with cells holding 1 (SLC), 2 (MLC), 3 (TLC), or 4 (QLC) bits. More bits per cell = more density but higher latency and lower endurance. HBF for AI inference will likely target TLC or QLC to maximize capacity, with controller intelligence compensating for the latency penalty.
The latency gap is the central engineering challenge. DRAM reads in ~10–50 nanoseconds. NAND reads in ~10–100 microseconds — three to four orders of magnitude slower. For HBF to be useful in a memory hierarchy, the controller must use aggressive prefetching, large read-ahead caches, and predictive scheduling to hide this latency behind compute.
03HBF vs HBM: Division of Labor
The correct mental model is not "HBF replaces HBM" but rather "HBF and HBM co-exist in a tiered architecture, each doing what it does best." The distinction maps almost exactly onto the difference between DRAM and SSD in today's server — except both tiers are now near-memory rather than across a PCIe bus.
| Attribute | HBM (DRAM stacks) | HBF (NAND stacks) |
|---|---|---|
| Technology | DRAM (volatile) | NAND flash (non-volatile) |
| Capacity | 80–160 GB | 1–8 TB |
| Bandwidth | 3–4 TB/s | ~500 GB/s target |
| Read latency | ~50 ns | ~10–100 µs (raw); ~1–10 µs w/ ctrl |
| Endurance | Unlimited writes | Limited P/E cycles (TLC ~1000, QLC ~300) |
| $/GB (est.) | ~$10–$15/GB | ~$0.10–$0.30/GB target |
| Primary AI role | Weights, activations, hot compute | KV cache, conversation state, context buffers |
| Power at idle | High (refresh cycles) | Very low (non-volatile, no refresh) |
| Interface | Wide 1024-bit+ HBM bus | Standardized — JEDEC HBF in progress |
| Packaging | HBM + GPU on interposer | NAND stack + controller on same substrate |
"HBM is optimized for handling data used immediately; its capacity itself is inherently limited. Processing enormous amounts of context data using HBM alone would impose significant cost burdens."
— Article synthesis, April 30 202604The Makers: Who's Building HBF
As of 2026, the HBF landscape is early but moving fast. A handful of companies are leading the technical and standardization race.
- Announced HBF standardization partnership with SanDisk, March 2026
- Leading push for JEDEC HBF standard alongside DRAM TSV leadership
- Leverages existing 3D NAND (176L+) and TSV packaging know-how
- HBM3E dominance gives them package integration expertise
- David Patterson cited SK hynix as a key HBF development player
- Co-leading global HBF standardization with SK hynix
- Deep NAND controller IP (NVMe, ZNS, UFS backgrounds)
- BiCS NAND architecture adaptable to stacked configurations
- Experience with high-density QLC at enterprise scale
- V-NAND technology leader (200+ layer process nodes)
- HBM2E/HBM3 production with Aquabolt packaging
- Building toward CXL-attached NAND and near-memory flash
- CXL memory module research intersects with HBF concepts
- 232-layer NAND process with leading density
- CXL DRAM modules bridge toward programmable near-memory
- Computational storage research relevant to HBF controller arch
- HBM3E production ramping for AI customers
- Jointly developed BiCS NAND with WD/SanDisk
- Participating in standardization as NAND IP holder
- Strong in QLC density optimization for capacity-heavy workloads
- HBF standardization effort in progress (early-stage as of 2026)
- Interface spec will define pin-out, bus width, command protocol
- Open standard critical to ecosystem adoption by NVIDIA, AMD, Intel
- Timeline: formal spec expected 2026–2028 window
05The Controller: The Hardest Engineering Problem
If NAND is the medium, the HBF controller is the intelligence that makes it behave like memory rather than storage. This is arguably the hardest part of the HBF design problem — the controller must compensate for NAND's fundamental physics in real time.
Flash Translation Layer (FTL)
NAND cannot overwrite data in-place — pages must be erased (in blocks) before being re-written. The FTL maintains a logical-to-physical address mapping table, handles out-of-place writes, and tracks which physical pages are valid. For HBF, the FTL mapping table itself can consume gigabytes of SRAM and must be optimized for near-memory access patterns, which look very different from NVMe storage patterns.
Prefetch and prediction
The critical latency-hiding mechanism for HBF is intelligent prefetching. Because an AI model's memory access pattern during inference is semi-predictable — attention heads access KV cache in structured strides, conversation state is accessed sequentially — the HBF controller can use small learned models or heuristic engines to predict which pages will be needed and load them into the on-die read cache before the GPU asks. This is the difference between raw NAND latency (~50 µs) and effective latency as seen by the GPU (~1–5 µs).
ECC and endurance
NAND flash has finite write endurance: QLC NAND can sustain roughly 300–1000 program/erase (P/E) cycles per cell. For an AI inference workload with heavy KV cache writes, wear leveling is not optional. The controller must distribute writes across all cells evenly, over-provision capacity for spare-area writes, and use LDPC (Low-Density Parity Check) codes to correct bit errors that accumulate as cells age.
Write amplification (WA) measures how many actual NAND writes result from one logical write. WA = 1 is ideal; GC and over-provisioning push WA to 1.5–3× in practice. For HBF in inference, minimizing WA extends device life and reduces background GC interference with latency-sensitive reads.
06Kernel Changes: What Linux Needs for HBF
HBF is not an NVMe SSD and not DRAM — it's a new memory class that falls between them. Current Linux kernels (5.x/6.x) have partial infrastructure for this via PMEM and CXL, but HBF will require targeted work across several subsystems.
1. NUMA-aware tiered memory management
HBF should appear to the kernel as a NUMA node with a distinct memory latency class — distinct from HBM (fast) and NVMe (very slow). The kernel's NUMA balancing daemon needs to understand "medium-latency, high-capacity" as a valid migration target.
2. DAMON-based hot/warm/cold page tracking
DAMON (Data Access MONitor) was merged in Linux 5.15 and provides lightweight access frequency tracking. For HBF, DAMON policies need extending to demote warm KV-cache pages from HBM → HBF and promote hot pages back, using async migration threads with HBF-specific latency budgets.
3. DAX / mmap() with HBF latency annotations
AI runtimes like vLLM directly manage their KV-cache memory. They need a zero-copy mmap interface to HBF, similar to the PMEM DAX path, but with semantics that tolerate asynchronous reads. The MAP_HBF hint or a new memfd_create() flag could signal to the kernel that this mapping is HBF-backed and should use read-ahead caching differently from DRAM.
4. io_uring passthrough for sub-microsecond I/O
For HBF accesses that go through the block layer (not via DAX mmap), the io_uring passthrough path (merged in 6.0) avoids the overhead of the full block layer queue. HBF drivers should expose a passthrough interface that lets the GPU driver submit read SQEs directly, bypassing VFS overhead and enabling polling-mode completions instead of interrupt-driven I/O.
5. Huge page and THP support for large KV allocations
KV cache allocations are large and contiguous. Transparent Huge Pages (THP) reduce TLB pressure and improve NAND page-boundary alignment. The HBF driver should request 2MB huge pages from the allocator, mapping them to 2MB NAND erase-block-aligned regions to minimize write amplification from partial-block updates.
6. Wear-aware write scheduling in the block layer
The block layer's I/O scheduler (mq-deadline, kyber) does not understand NAND wear. A new HBF-aware scheduler or elevator plugin should: batch writes to minimize GC overhead, co-schedule GC with idle inference periods, and export wear telemetry via sysfs so the AI orchestration layer can shift write load between HBF devices.
07Why Inference Is HBF's Perfect Workload
Not every compute workload benefits equally from HBF. Training — which continuously reads and writes activations and gradients with extremely high bandwidth demand — is squarely in HBM's territory. But inference has a different access profile that plays to HBF's strengths.
The KV cache explosion
Modern LLMs use attention mechanisms that require caching key-value pairs for every token in the context window. For a 70B-parameter model with 128 attention heads running a 200K-token context, the KV cache can reach 50–100 GB per concurrent session. Multiply this by tens or hundreds of concurrent users, and the KV cache demand vastly exceeds HBM capacity. Without an overflow tier, the system must evict and recompute — wasting seconds of compute per request.
Access pattern predictability
KV cache access in autoregressive generation is highly predictable: the model accesses token positions in sequence, with attention concentrating on recent and salient tokens. This is exactly the kind of pattern that an HBF prefetch engine can exploit — predicting which cache pages will be needed for the next attention layer and preloading them from NAND into the on-die SRAM buffer before the GPU asks.
Agentic and multi-turn workloads
As AI shifts from single-turn chat to long-running agentic workflows — where a model maintains context across hundreds of tool calls, searches, and sub-tasks — the state that must persist grows dramatically. Between turns, this state can be "parked" on HBF (non-volatile!) and retrieved without re-loading from slower storage or recomputing. This is HBF's killer advantage over DRAM: persistence eliminates cold-start latency for long-horizon tasks.
08HBF Development Timeline
09The Verdict: Why This Matters
Every decade of computing has had a memory story. The 1990s were about adding L2 cache so CPUs didn't starve. The 2000s were about DDR scaling to meet multi-core demand. The 2010s were about NVMe replacing SATA and DRAM pressure from in-memory databases. The 2020s are about HBM feeding tensor cores.
The 2030s will be about HBF enabling AI to remember.
The fundamental shift is this: AI is transitioning from a stateless compute problem (training: ingest data, update weights, discard activations) to a stateful inference problem (maintain context, accumulate memory, reason over history). Stateful AI needs a memory tier that is large, relatively fast, low-power, and persistent. DRAM is too small and too expensive. NVMe is too slow. HBF is built to be exactly what sits between them.
The technical challenges are real — latency hiding, wear management, standardization, kernel integration, write amplification, FTL overhead. None of them are insurmountable. The physics of NAND flash and the packaging innovation of TSV stacking are both mature enough to support HBF. The remaining work is controller intelligence, standards alignment, and software ecosystem development.
David Patterson has called two of the most important architectural shifts in computing history — RISC and, with John Hennessy, the RISC-V open ISA. When he says HBF is the next bottleneck, the industry listens. This time, the bottleneck is not about processing power. It is about giving AI the memory it needs to think at depth.
"If the CPU was the core in the PC era and low-power technology was the core in the smartphone era, memory will be the core of the AI era."
— Prof. Kim Jung-ho, KAIST, February 2026