Deep Dive · Memory Architecture · May 2026

The Rise of
High Bandwidth Flash

After HBM redefined AI training, the inference era demands a new memory tier — one that scales capacity, not just speed. HBF is that tier. This is the full technical story.

HBF / High Bandwidth Flash ~14 min read Covers silicon · controllers · kernel · makers

01Why the Next Bottleneck Is Inevitable

Every hardware era produces a bottleneck that constrains the era that follows it. In the PC era, CPUs starved for cache. In mobile, power walls constrained clock speeds. In the deep learning era, the GPU's on-die compute ran ahead of the memory bandwidth that could feed it. HBM — High Bandwidth Memory — was the answer to that: DRAM dies stacked vertically, connected through silicon interposers, delivering terabytes-per-second to hungry accelerators.

HBM solved the bandwidth problem for training. But training is only half the story. The other half — inference — has a fundamentally different workload shape. When a model reasons over long contexts, manages multi-turn memory, or runs agentic chains of tasks, it generates enormous quantities of intermediate state: KV caches, conversation histories, chain-of-thought traces, tool outputs. This data isn't small. At 200K token contexts with a 70B-parameter model, the KV cache alone can consume dozens of gigabytes per request.

"What determines speed is HBM. What determines capacity is HBF."

— Prof. Kim Jung-ho, KAIST School of Electrical and Electronic Engineering, Feb 2026

HBM's capacity ceiling — typically 80–160 GB on the largest accelerator packages today — is insufficient to hold this explosion of context state. Worse, HBM is priced at a premium that makes using it for bulk context storage economically indefensible. Enter HBF: a new memory tier built from NAND flash, stacked with near-memory bandwidth ambitions, designed to hold what HBM cannot.

Fig 1 — Memory hierarchy in the AI inference stack

▲ HBF occupies a new tier between DRAM and SSD — targeting the "large capacity, near-memory bandwidth" gap that no existing technology fills.

02Anatomy of HBF: How It's Built

HBF takes the packaging philosophy of HBM and applies it to NAND flash instead of DRAM. The result is a fundamentally different device with different physics, different access patterns, and different failure modes — but packaged in a way that can sit on the same silicon substrate as an AI accelerator.

Fig 2 — HBF die stack cross-section (schematic)

▲ HBF uses Through-Silicon Vias (TSVs) to connect stacked NAND dies to an in-package controller, just as HBM stacks DRAM with a base logic die. More layers = more capacity.

Key structural differences from HBM

NAND flash is fundamentally non-volatile: data survives power loss. It stores charge in floating-gate or charge-trap transistors, with cells holding 1 (SLC), 2 (MLC), 3 (TLC), or 4 (QLC) bits. More bits per cell = more density but higher latency and lower endurance. HBF for AI inference will likely target TLC or QLC to maximize capacity, with controller intelligence compensating for the latency penalty.

The latency gap is the central engineering challenge. DRAM reads in ~10–50 nanoseconds. NAND reads in ~10–100 microseconds — three to four orders of magnitude slower. For HBF to be useful in a memory hierarchy, the controller must use aggressive prefetching, large read-ahead caches, and predictive scheduling to hide this latency behind compute.

Capacity target

1–8 TB

per HBF package; 10–50× HBM

Bandwidth target

~500 GB/s

via wide bus + parallelism

NAND read latency

10–100 µs

vs ~50 ns for DRAM; hidden by controller

Power vs DRAM

~3–5×

lower at rest; NAND non-volatile

03HBF vs HBM: Division of Labor

The correct mental model is not "HBF replaces HBM" but rather "HBF and HBM co-exist in a tiered architecture, each doing what it does best." The distinction maps almost exactly onto the difference between DRAM and SSD in today's server — except both tiers are now near-memory rather than across a PCIe bus.

Attribute	HBM (DRAM stacks)	HBF (NAND stacks)
Technology	DRAM (volatile)	NAND flash (non-volatile)
Capacity	80–160 GB	1–8 TB
Bandwidth	3–4 TB/s	~500 GB/s target
Read latency	~50 ns	~10–100 µs (raw); ~1–10 µs w/ ctrl
Endurance	Unlimited writes	Limited P/E cycles (TLC ~1000, QLC ~300)
$/GB (est.)	~$10–$15/GB	~$0.10–$0.30/GB target
Primary AI role	Weights, activations, hot compute	KV cache, conversation state, context buffers
Power at idle	High (refresh cycles)	Very low (non-volatile, no refresh)
Interface	Wide 1024-bit+ HBM bus	Standardized — JEDEC HBF in progress
Packaging	HBM + GPU on interposer	NAND stack + controller on same substrate

"HBM is optimized for handling data used immediately; its capacity itself is inherently limited. Processing enormous amounts of context data using HBM alone would impose significant cost burdens."

— Article synthesis, April 30 2026

Fig 3 — Inference request: data flow across memory tiers

▲ Inference loop: new context is prefetched from HBF → promoted to HBM → consumed by accelerator. Evicted pages flow back to HBF. Neither tier can serve the full workflow alone.

04The Makers: Who's Building HBF

As of 2026, the HBF landscape is early but moving fast. A handful of companies are leading the technical and standardization race.

SK hynix

NAND + DRAM vertically integrated

Announced HBF standardization partnership with SanDisk, March 2026
Leading push for JEDEC HBF standard alongside DRAM TSV leadership
Leverages existing 3D NAND (176L+) and TSV packaging know-how
HBM3E dominance gives them package integration expertise
David Patterson cited SK hynix as a key HBF development player

SanDisk (WD)

NAND specialist, HBF standards co-lead

Co-leading global HBF standardization with SK hynix
Deep NAND controller IP (NVMe, ZNS, UFS backgrounds)
BiCS NAND architecture adaptable to stacked configurations
Experience with high-density QLC at enterprise scale

Samsung

NAND + DRAM + logic — full-stack

V-NAND technology leader (200+ layer process nodes)
HBM2E/HBM3 production with Aquabolt packaging
Building toward CXL-attached NAND and near-memory flash
CXL memory module research intersects with HBF concepts

Micron

NAND + DRAM, CXL and near-memory focus

232-layer NAND process with leading density
CXL DRAM modules bridge toward programmable near-memory
Computational storage research relevant to HBF controller arch
HBM3E production ramping for AI customers

Kioxia

BiCS NAND, SanDisk's JV partner

Jointly developed BiCS NAND with WD/SanDisk
Participating in standardization as NAND IP holder
Strong in QLC density optimization for capacity-heavy workloads

JEDEC / Standards

Standardization body

HBF standardization effort in progress (early-stage as of 2026)
Interface spec will define pin-out, bus width, command protocol
Open standard critical to ecosystem adoption by NVIDIA, AMD, Intel
Timeline: formal spec expected 2026–2028 window

05The Controller: The Hardest Engineering Problem

If NAND is the medium, the HBF controller is the intelligence that makes it behave like memory rather than storage. This is arguably the hardest part of the HBF design problem — the controller must compensate for NAND's fundamental physics in real time.

Fig 4 — HBF controller architecture block diagram

▲ The HBF controller must solve FTL mapping, latency hiding, ECC, wear leveling, and GC — all in-package, with sub-microsecond response budgets for prefetched data.

Flash Translation Layer (FTL)

NAND cannot overwrite data in-place — pages must be erased (in blocks) before being re-written. The FTL maintains a logical-to-physical address mapping table, handles out-of-place writes, and tracks which physical pages are valid. For HBF, the FTL mapping table itself can consume gigabytes of SRAM and must be optimized for near-memory access patterns, which look very different from NVMe storage patterns.

Prefetch and prediction

The critical latency-hiding mechanism for HBF is intelligent prefetching. Because an AI model's memory access pattern during inference is semi-predictable — attention heads access KV cache in structured strides, conversation state is accessed sequentially — the HBF controller can use small learned models or heuristic engines to predict which pages will be needed and load them into the on-die read cache before the GPU asks. This is the difference between raw NAND latency (~50 µs) and effective latency as seen by the GPU (~1–5 µs).

ECC and endurance

NAND flash has finite write endurance: QLC NAND can sustain roughly 300–1000 program/erase (P/E) cycles per cell. For an AI inference workload with heavy KV cache writes, wear leveling is not optional. The controller must distribute writes across all cells evenly, over-provision capacity for spare-area writes, and use LDPC (Low-Density Parity Check) codes to correct bit errors that accumulate as cells age.

Key metric: write amplification

Write amplification (WA) measures how many actual NAND writes result from one logical write. WA = 1 is ideal; GC and over-provisioning push WA to 1.5–3× in practice. For HBF in inference, minimizing WA extends device life and reduces background GC interference with latency-sensitive reads.

06Kernel Changes: What Linux Needs for HBF

HBF is not an NVMe SSD and not DRAM — it's a new memory class that falls between them. Current Linux kernels (5.x/6.x) have partial infrastructure for this via PMEM and CXL, but HBF will require targeted work across several subsystems.

Fig 5 — Linux kernel subsystem changes for HBF support

▲ Linux will need changes across MM, CXL/PMEM, block layer, and a new HBF-specific driver. The tiered memory manager is the central coordination layer.

1. NUMA-aware tiered memory management

linux/mm

HBF should appear to the kernel as a NUMA node with a distinct memory latency class — distinct from HBM (fast) and NVMe (very slow). The kernel's NUMA balancing daemon needs to understand "medium-latency, high-capacity" as a valid migration target.

/* NUMA node config for HBF tier */ node_set_state(hbf_nid, N_MEMORY); set_node_tier(hbf_nid, MEMORY_TIER_SLOW); /* new tier enum */

2. DAMON-based hot/warm/cold page tracking

mm/damon

DAMON (Data Access MONitor) was merged in Linux 5.15 and provides lightweight access frequency tracking. For HBF, DAMON policies need extending to demote warm KV-cache pages from HBM → HBF and promote hot pages back, using async migration threads with HBF-specific latency budgets.

/* Extend DAMON for three-tier demotion */ damos_action = DAMOS_MIGRATE_HOT → HBM damos_action = DAMOS_MIGRATE_WARM → HBF /* new */ damos_action = DAMOS_MIGRATE_COLD → NVMe / swap

3. DAX / mmap() with HBF latency annotations

fs/dax

AI runtimes like vLLM directly manage their KV-cache memory. They need a zero-copy mmap interface to HBF, similar to the PMEM DAX path, but with semantics that tolerate asynchronous reads. The MAP_HBF hint or a new memfd_create() flag could signal to the kernel that this mapping is HBF-backed and should use read-ahead caching differently from DRAM.

/* Proposed interface for runtime KV-cache on HBF */ fd = memfd_create("kv_cache", MFD_HBF | MFD_HUGETLB); addr = mmap(NULL, sz, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0); madvise(addr, sz, MADV_HBF_PREFETCH); /* new advice flag */

4. io_uring passthrough for sub-microsecond I/O

io_uring

For HBF accesses that go through the block layer (not via DAX mmap), the io_uring passthrough path (merged in 6.0) avoids the overhead of the full block layer queue. HBF drivers should expose a passthrough interface that lets the GPU driver submit read SQEs directly, bypassing VFS overhead and enabling polling-mode completions instead of interrupt-driven I/O.

/* io_uring HBF passthrough — polling mode */ struct io_uring_sqe sqe = { .opcode = IORING_OP_URING_CMD, .flags = IOSQE_FIXED_FILE, .cmd_op = HBF_CMD_READ_PREFETCH, /* vendor-defined */ };

5. Huge page and THP support for large KV allocations

mm/huge_memory

KV cache allocations are large and contiguous. Transparent Huge Pages (THP) reduce TLB pressure and improve NAND page-boundary alignment. The HBF driver should request 2MB huge pages from the allocator, mapping them to 2MB NAND erase-block-aligned regions to minimize write amplification from partial-block updates.

6. Wear-aware write scheduling in the block layer

block/

The block layer's I/O scheduler (mq-deadline, kyber) does not understand NAND wear. A new HBF-aware scheduler or elevator plugin should: batch writes to minimize GC overhead, co-schedule GC with idle inference periods, and export wear telemetry via sysfs so the AI orchestration layer can shift write load between HBF devices.

/* Hypothetical HBF wear telemetry sysfs */ /sys/class/hbf/hbf0/wear_indicator # 0–100 wear percentage /sys/class/hbf/hbf0/gc_pressure # background GC intensity /sys/class/hbf/hbf0/waf # write amplification factor (live)

07Why Inference Is HBF's Perfect Workload

Not every compute workload benefits equally from HBF. Training — which continuously reads and writes activations and gradients with extremely high bandwidth demand — is squarely in HBM's territory. But inference has a different access profile that plays to HBF's strengths.

The KV cache explosion

Modern LLMs use attention mechanisms that require caching key-value pairs for every token in the context window. For a 70B-parameter model with 128 attention heads running a 200K-token context, the KV cache can reach 50–100 GB per concurrent session. Multiply this by tens or hundreds of concurrent users, and the KV cache demand vastly exceeds HBM capacity. Without an overflow tier, the system must evict and recompute — wasting seconds of compute per request.

Access pattern predictability

KV cache access in autoregressive generation is highly predictable: the model accesses token positions in sequence, with attention concentrating on recent and salient tokens. This is exactly the kind of pattern that an HBF prefetch engine can exploit — predicting which cache pages will be needed for the next attention layer and preloading them from NAND into the on-die SRAM buffer before the GPU asks.

Agentic and multi-turn workloads

As AI shifts from single-turn chat to long-running agentic workflows — where a model maintains context across hundreds of tool calls, searches, and sub-tasks — the state that must persist grows dramatically. Between turns, this state can be "parked" on HBF (non-volatile!) and retrieved without re-loading from slower storage or recomputing. This is HBF's killer advantage over DRAM: persistence eliminates cold-start latency for long-horizon tasks.

KV cache / session (70B, 200K ctx)

50–100 GB

HBM alone cannot hold 10+ concurrent sessions

Context recompute cost saved

~10× latency

prefetch from HBF vs full KV recompute

HBF capacity advantage

10–50×

over HBM for same package area

Demand crossover year (projected)

2038

HBF demand to exceed HBM, per KAIST forecast

08HBF Development Timeline

2022–2023

Academic groundwork + CXL memory research

Early papers on near-memory flash; CXL 2.0 spec enables memory-semantic access to non-DRAM; computational storage research explores in-storage processing concepts relevant to HBF controller design.

2024

AI inference bottleneck becomes visible at scale

HBM supply crunch forces hyperscalers to explore alternatives; vLLM and PagedAttention demonstrate KV cache pressure in production; industry interest in "memory-bandwidth flash" accelerates internally.

Feb 2026

KAIST HBF research briefing

Prof. Kim Jung-ho publicly frames HBF as the capacity layer of the AI memory stack; predicts HBF demand overtaking HBM from 2038 onward.

Mar 2026

SK hynix + SanDisk announce HBF standardization partnership

First public industry alliance specifically targeting global HBF standardization. Signal that JEDEC engagement is imminent.

Apr 2026

David Patterson endorsement at Dreamy Next

Turing laureate and RISC pioneer publicly predicts HBF as the next memory bottleneck; confirms active collaboration with semiconductor companies on HBF development.

2027–2028 (projected)

JEDEC HBF specification finalized

Standardized interface enables second-source competition and AI accelerator integration. Linux HBF driver upstreaming begins.

2029–2031 (projected)

First HBF-equipped AI accelerators

Next-generation GPU/NPU platforms include HBF alongside HBM; AI frameworks gain native HBF-aware KV cache management.

2038+ (projected)

HBF demand exceeds HBM

As inference workloads dominate AI compute spend and context lengths continue growing, HBF TAM surpasses HBM. Memory defines the AI era.

09The Verdict: Why This Matters

Every decade of computing has had a memory story. The 1990s were about adding L2 cache so CPUs didn't starve. The 2000s were about DDR scaling to meet multi-core demand. The 2010s were about NVMe replacing SATA and DRAM pressure from in-memory databases. The 2020s are about HBM feeding tensor cores.

The 2030s will be about HBF enabling AI to remember.

The fundamental shift is this: AI is transitioning from a stateless compute problem (training: ingest data, update weights, discard activations) to a stateful inference problem (maintain context, accumulate memory, reason over history). Stateful AI needs a memory tier that is large, relatively fast, low-power, and persistent. DRAM is too small and too expensive. NVMe is too slow. HBF is built to be exactly what sits between them.

The technical challenges are real — latency hiding, wear management, standardization, kernel integration, write amplification, FTL overhead. None of them are insurmountable. The physics of NAND flash and the packaging innovation of TSV stacking are both mature enough to support HBF. The remaining work is controller intelligence, standards alignment, and software ecosystem development.

David Patterson has called two of the most important architectural shifts in computing history — RISC and, with John Hennessy, the RISC-V open ISA. When he says HBF is the next bottleneck, the industry listens. This time, the bottleneck is not about processing power. It is about giving AI the memory it needs to think at depth.

"If the CPU was the core in the PC era and low-power technology was the core in the smartphone era, memory will be the core of the AI era."

— Prof. Kim Jung-ho, KAIST, February 2026

The Rise ofHigh Bandwidth Flash