MANISH AI
Deep Dive  ·  Memory Architecture  ·  May 2026

The Rise of
High Bandwidth Flash

After HBM redefined AI training, the inference era demands a new memory tier — one that scales capacity, not just speed. HBF is that tier. This is the full technical story.

HBF / High Bandwidth Flash ~14 min read Covers silicon · controllers · kernel · makers

01Why the Next Bottleneck Is Inevitable

Every hardware era produces a bottleneck that constrains the era that follows it. In the PC era, CPUs starved for cache. In mobile, power walls constrained clock speeds. In the deep learning era, the GPU's on-die compute ran ahead of the memory bandwidth that could feed it. HBM — High Bandwidth Memory — was the answer to that: DRAM dies stacked vertically, connected through silicon interposers, delivering terabytes-per-second to hungry accelerators.

HBM solved the bandwidth problem for training. But training is only half the story. The other half — inference — has a fundamentally different workload shape. When a model reasons over long contexts, manages multi-turn memory, or runs agentic chains of tasks, it generates enormous quantities of intermediate state: KV caches, conversation histories, chain-of-thought traces, tool outputs. This data isn't small. At 200K token contexts with a 70B-parameter model, the KV cache alone can consume dozens of gigabytes per request.

"What determines speed is HBM. What determines capacity is HBF."

— Prof. Kim Jung-ho, KAIST School of Electrical and Electronic Engineering, Feb 2026

HBM's capacity ceiling — typically 80–160 GB on the largest accelerator packages today — is insufficient to hold this explosion of context state. Worse, HBM is priced at a premium that makes using it for bulk context storage economically indefensible. Enter HBF: a new memory tier built from NAND flash, stacked with near-memory bandwidth ambitions, designed to hold what HBM cannot.

Fig 1 — Memory hierarchy in the AI inference stack
SRAM / Registers — on-die ~50 MB · 50 TB/s HBM — stacked DRAM 80–160 GB · 3–4 TB/s ◀ HBF — stacked NAND flash [THIS LAYER] 1–8 TB · ~500 GB/s target KV-cache · conversation state · agent context buffers NVMe SSD — enterprise flash 1–30 TB · ~14 GB/s ↑ speed ↑ cap.
▲ HBF occupies a new tier between DRAM and SSD — targeting the "large capacity, near-memory bandwidth" gap that no existing technology fills.

02Anatomy of HBF: How It's Built

HBF takes the packaging philosophy of HBM and applies it to NAND flash instead of DRAM. The result is a fundamentally different device with different physics, different access patterns, and different failure modes — but packaged in a way that can sit on the same silicon substrate as an AI accelerator.

Fig 2 — HBF die stack cross-section (schematic)
Silicon interposer / organic substrate — package base Base die (logic) HBM (DRAM dies) 4–12 layers vs NAND controller die (in-package) HBF (stacked NAND) 8–16+ NAND dies via TSV QLC NAND QLC NAND ···
▲ HBF uses Through-Silicon Vias (TSVs) to connect stacked NAND dies to an in-package controller, just as HBM stacks DRAM with a base logic die. More layers = more capacity.

Key structural differences from HBM

NAND flash is fundamentally non-volatile: data survives power loss. It stores charge in floating-gate or charge-trap transistors, with cells holding 1 (SLC), 2 (MLC), 3 (TLC), or 4 (QLC) bits. More bits per cell = more density but higher latency and lower endurance. HBF for AI inference will likely target TLC or QLC to maximize capacity, with controller intelligence compensating for the latency penalty.

The latency gap is the central engineering challenge. DRAM reads in ~10–50 nanoseconds. NAND reads in ~10–100 microseconds — three to four orders of magnitude slower. For HBF to be useful in a memory hierarchy, the controller must use aggressive prefetching, large read-ahead caches, and predictive scheduling to hide this latency behind compute.

Capacity target
1–8 TB
per HBF package; 10–50× HBM
Bandwidth target
~500 GB/s
via wide bus + parallelism
NAND read latency
10–100 µs
vs ~50 ns for DRAM; hidden by controller
Power vs DRAM
~3–5×
lower at rest; NAND non-volatile

03HBF vs HBM: Division of Labor

The correct mental model is not "HBF replaces HBM" but rather "HBF and HBM co-exist in a tiered architecture, each doing what it does best." The distinction maps almost exactly onto the difference between DRAM and SSD in today's server — except both tiers are now near-memory rather than across a PCIe bus.

Attribute HBM (DRAM stacks) HBF (NAND stacks)
TechnologyDRAM (volatile)NAND flash (non-volatile)
Capacity80–160 GB1–8 TB
Bandwidth3–4 TB/s~500 GB/s target
Read latency~50 ns~10–100 µs (raw); ~1–10 µs w/ ctrl
EnduranceUnlimited writesLimited P/E cycles (TLC ~1000, QLC ~300)
$/GB (est.)~$10–$15/GB~$0.10–$0.30/GB target
Primary AI roleWeights, activations, hot computeKV cache, conversation state, context buffers
Power at idleHigh (refresh cycles)Very low (non-volatile, no refresh)
InterfaceWide 1024-bit+ HBM busStandardized — JEDEC HBF in progress
PackagingHBM + GPU on interposerNAND stack + controller on same substrate

"HBM is optimized for handling data used immediately; its capacity itself is inherently limited. Processing enormous amounts of context data using HBM alone would impose significant cost burdens."

— Article synthesis, April 30 2026
Fig 3 — Inference request: data flow across memory tiers
New request + conversation ctx HBF KV cache store conversation history tool call results 1–8 TB capacity HBM model weights (hot) active KV cache activations 80–160 GB GPU / TPU SRAM, tensor cores generates tokens prefetch write-back: evicted KV pages → HBF
▲ Inference loop: new context is prefetched from HBF → promoted to HBM → consumed by accelerator. Evicted pages flow back to HBF. Neither tier can serve the full workflow alone.

04The Makers: Who's Building HBF

As of 2026, the HBF landscape is early but moving fast. A handful of companies are leading the technical and standardization race.

SK hynix
NAND + DRAM vertically integrated
  • Announced HBF standardization partnership with SanDisk, March 2026
  • Leading push for JEDEC HBF standard alongside DRAM TSV leadership
  • Leverages existing 3D NAND (176L+) and TSV packaging know-how
  • HBM3E dominance gives them package integration expertise
  • David Patterson cited SK hynix as a key HBF development player
SanDisk (WD)
NAND specialist, HBF standards co-lead
  • Co-leading global HBF standardization with SK hynix
  • Deep NAND controller IP (NVMe, ZNS, UFS backgrounds)
  • BiCS NAND architecture adaptable to stacked configurations
  • Experience with high-density QLC at enterprise scale
Samsung
NAND + DRAM + logic — full-stack
  • V-NAND technology leader (200+ layer process nodes)
  • HBM2E/HBM3 production with Aquabolt packaging
  • Building toward CXL-attached NAND and near-memory flash
  • CXL memory module research intersects with HBF concepts
Micron
NAND + DRAM, CXL and near-memory focus
  • 232-layer NAND process with leading density
  • CXL DRAM modules bridge toward programmable near-memory
  • Computational storage research relevant to HBF controller arch
  • HBM3E production ramping for AI customers
Kioxia
BiCS NAND, SanDisk's JV partner
  • Jointly developed BiCS NAND with WD/SanDisk
  • Participating in standardization as NAND IP holder
  • Strong in QLC density optimization for capacity-heavy workloads
JEDEC / Standards
Standardization body
  • HBF standardization effort in progress (early-stage as of 2026)
  • Interface spec will define pin-out, bus width, command protocol
  • Open standard critical to ecosystem adoption by NVIDIA, AMD, Intel
  • Timeline: formal spec expected 2026–2028 window

05The Controller: The Hardest Engineering Problem

If NAND is the medium, the HBF controller is the intelligence that makes it behave like memory rather than storage. This is arguably the hardest part of the HBF design problem — the controller must compensate for NAND's fundamental physics in real time.

Fig 4 — HBF controller architecture block diagram
Host GPU / Accelerator HBF bus interface ~500 GB/s HBF controller die (in-package) FTL Engine logical→physical mapping Read Cache SRAM / eDRAM buffer Prefetch Engine AI-driven page prediction ECC / LDPC error correction & retry Wear Leveling P/E cycle management Garbage Collector background GC / write amp. Power & thermal management NAND die stack 8–16 dies parallel I/O channels
▲ The HBF controller must solve FTL mapping, latency hiding, ECC, wear leveling, and GC — all in-package, with sub-microsecond response budgets for prefetched data.

Flash Translation Layer (FTL)

NAND cannot overwrite data in-place — pages must be erased (in blocks) before being re-written. The FTL maintains a logical-to-physical address mapping table, handles out-of-place writes, and tracks which physical pages are valid. For HBF, the FTL mapping table itself can consume gigabytes of SRAM and must be optimized for near-memory access patterns, which look very different from NVMe storage patterns.

Prefetch and prediction

The critical latency-hiding mechanism for HBF is intelligent prefetching. Because an AI model's memory access pattern during inference is semi-predictable — attention heads access KV cache in structured strides, conversation state is accessed sequentially — the HBF controller can use small learned models or heuristic engines to predict which pages will be needed and load them into the on-die read cache before the GPU asks. This is the difference between raw NAND latency (~50 µs) and effective latency as seen by the GPU (~1–5 µs).

ECC and endurance

NAND flash has finite write endurance: QLC NAND can sustain roughly 300–1000 program/erase (P/E) cycles per cell. For an AI inference workload with heavy KV cache writes, wear leveling is not optional. The controller must distribute writes across all cells evenly, over-provision capacity for spare-area writes, and use LDPC (Low-Density Parity Check) codes to correct bit errors that accumulate as cells age.

Key metric: write amplification

Write amplification (WA) measures how many actual NAND writes result from one logical write. WA = 1 is ideal; GC and over-provisioning push WA to 1.5–3× in practice. For HBF in inference, minimizing WA extends device life and reduces background GC interference with latency-sensitive reads.

06Kernel Changes: What Linux Needs for HBF

HBF is not an NVMe SSD and not DRAM — it's a new memory class that falls between them. Current Linux kernels (5.x/6.x) have partial infrastructure for this via PMEM and CXL, but HBF will require targeted work across several subsystems.

Fig 5 — Linux kernel subsystem changes for HBF support
User space — AI runtime (PyTorch, CUDA, ROCm, vLLM) syscall boundary + io_uring / io_uring passthrough MM subsystem NUMA node for HBF mmap() on HBF region huge page support CXL / PMEM layer DAX device interface CXL.mem protocol persistent memory model Block + NVMe layer multi-queue (blk-mq) ZNS / ZBD zones io_uring passthrough HBF driver NEW: to be built HBF bus protocol latency hint API Tiered memory manager (DAMON + NUMA balancing extended for HBF) hot pages → HBM · warm pages → HBF · cold pages → NVMe HBF hardware — NAND stack + in-package controller
▲ Linux will need changes across MM, CXL/PMEM, block layer, and a new HBF-specific driver. The tiered memory manager is the central coordination layer.

1. NUMA-aware tiered memory management

linux/mm

HBF should appear to the kernel as a NUMA node with a distinct memory latency class — distinct from HBM (fast) and NVMe (very slow). The kernel's NUMA balancing daemon needs to understand "medium-latency, high-capacity" as a valid migration target.

/* NUMA node config for HBF tier */ node_set_state(hbf_nid, N_MEMORY); set_node_tier(hbf_nid, MEMORY_TIER_SLOW); /* new tier enum */

2. DAMON-based hot/warm/cold page tracking

mm/damon

DAMON (Data Access MONitor) was merged in Linux 5.15 and provides lightweight access frequency tracking. For HBF, DAMON policies need extending to demote warm KV-cache pages from HBM → HBF and promote hot pages back, using async migration threads with HBF-specific latency budgets.

/* Extend DAMON for three-tier demotion */ damos_action = DAMOS_MIGRATE_HOT → HBM damos_action = DAMOS_MIGRATE_WARM → HBF /* new */ damos_action = DAMOS_MIGRATE_COLD → NVMe / swap

3. DAX / mmap() with HBF latency annotations

fs/dax

AI runtimes like vLLM directly manage their KV-cache memory. They need a zero-copy mmap interface to HBF, similar to the PMEM DAX path, but with semantics that tolerate asynchronous reads. The MAP_HBF hint or a new memfd_create() flag could signal to the kernel that this mapping is HBF-backed and should use read-ahead caching differently from DRAM.

/* Proposed interface for runtime KV-cache on HBF */ fd = memfd_create("kv_cache", MFD_HBF | MFD_HUGETLB); addr = mmap(NULL, sz, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0); madvise(addr, sz, MADV_HBF_PREFETCH); /* new advice flag */

4. io_uring passthrough for sub-microsecond I/O

io_uring

For HBF accesses that go through the block layer (not via DAX mmap), the io_uring passthrough path (merged in 6.0) avoids the overhead of the full block layer queue. HBF drivers should expose a passthrough interface that lets the GPU driver submit read SQEs directly, bypassing VFS overhead and enabling polling-mode completions instead of interrupt-driven I/O.

/* io_uring HBF passthrough — polling mode */ struct io_uring_sqe sqe = { .opcode = IORING_OP_URING_CMD, .flags = IOSQE_FIXED_FILE, .cmd_op = HBF_CMD_READ_PREFETCH, /* vendor-defined */ };

5. Huge page and THP support for large KV allocations

mm/huge_memory

KV cache allocations are large and contiguous. Transparent Huge Pages (THP) reduce TLB pressure and improve NAND page-boundary alignment. The HBF driver should request 2MB huge pages from the allocator, mapping them to 2MB NAND erase-block-aligned regions to minimize write amplification from partial-block updates.

6. Wear-aware write scheduling in the block layer

block/

The block layer's I/O scheduler (mq-deadline, kyber) does not understand NAND wear. A new HBF-aware scheduler or elevator plugin should: batch writes to minimize GC overhead, co-schedule GC with idle inference periods, and export wear telemetry via sysfs so the AI orchestration layer can shift write load between HBF devices.

/* Hypothetical HBF wear telemetry sysfs */ /sys/class/hbf/hbf0/wear_indicator # 0–100 wear percentage /sys/class/hbf/hbf0/gc_pressure # background GC intensity /sys/class/hbf/hbf0/waf # write amplification factor (live)

07Why Inference Is HBF's Perfect Workload

Not every compute workload benefits equally from HBF. Training — which continuously reads and writes activations and gradients with extremely high bandwidth demand — is squarely in HBM's territory. But inference has a different access profile that plays to HBF's strengths.

The KV cache explosion

Modern LLMs use attention mechanisms that require caching key-value pairs for every token in the context window. For a 70B-parameter model with 128 attention heads running a 200K-token context, the KV cache can reach 50–100 GB per concurrent session. Multiply this by tens or hundreds of concurrent users, and the KV cache demand vastly exceeds HBM capacity. Without an overflow tier, the system must evict and recompute — wasting seconds of compute per request.

Access pattern predictability

KV cache access in autoregressive generation is highly predictable: the model accesses token positions in sequence, with attention concentrating on recent and salient tokens. This is exactly the kind of pattern that an HBF prefetch engine can exploit — predicting which cache pages will be needed for the next attention layer and preloading them from NAND into the on-die SRAM buffer before the GPU asks.

Agentic and multi-turn workloads

As AI shifts from single-turn chat to long-running agentic workflows — where a model maintains context across hundreds of tool calls, searches, and sub-tasks — the state that must persist grows dramatically. Between turns, this state can be "parked" on HBF (non-volatile!) and retrieved without re-loading from slower storage or recomputing. This is HBF's killer advantage over DRAM: persistence eliminates cold-start latency for long-horizon tasks.

KV cache / session (70B, 200K ctx)
50–100 GB
HBM alone cannot hold 10+ concurrent sessions
Context recompute cost saved
~10× latency
prefetch from HBF vs full KV recompute
HBF capacity advantage
10–50×
over HBM for same package area
Demand crossover year (projected)
2038
HBF demand to exceed HBM, per KAIST forecast

08HBF Development Timeline

2022–2023
Academic groundwork + CXL memory research
Early papers on near-memory flash; CXL 2.0 spec enables memory-semantic access to non-DRAM; computational storage research explores in-storage processing concepts relevant to HBF controller design.
2024
AI inference bottleneck becomes visible at scale
HBM supply crunch forces hyperscalers to explore alternatives; vLLM and PagedAttention demonstrate KV cache pressure in production; industry interest in "memory-bandwidth flash" accelerates internally.
Feb 2026
KAIST HBF research briefing
Prof. Kim Jung-ho publicly frames HBF as the capacity layer of the AI memory stack; predicts HBF demand overtaking HBM from 2038 onward.
Mar 2026
SK hynix + SanDisk announce HBF standardization partnership
First public industry alliance specifically targeting global HBF standardization. Signal that JEDEC engagement is imminent.
Apr 2026
David Patterson endorsement at Dreamy Next
Turing laureate and RISC pioneer publicly predicts HBF as the next memory bottleneck; confirms active collaboration with semiconductor companies on HBF development.
2027–2028 (projected)
JEDEC HBF specification finalized
Standardized interface enables second-source competition and AI accelerator integration. Linux HBF driver upstreaming begins.
2029–2031 (projected)
First HBF-equipped AI accelerators
Next-generation GPU/NPU platforms include HBF alongside HBM; AI frameworks gain native HBF-aware KV cache management.
2038+ (projected)
HBF demand exceeds HBM
As inference workloads dominate AI compute spend and context lengths continue growing, HBF TAM surpasses HBM. Memory defines the AI era.

09The Verdict: Why This Matters

Every decade of computing has had a memory story. The 1990s were about adding L2 cache so CPUs didn't starve. The 2000s were about DDR scaling to meet multi-core demand. The 2010s were about NVMe replacing SATA and DRAM pressure from in-memory databases. The 2020s are about HBM feeding tensor cores.

The 2030s will be about HBF enabling AI to remember.

The fundamental shift is this: AI is transitioning from a stateless compute problem (training: ingest data, update weights, discard activations) to a stateful inference problem (maintain context, accumulate memory, reason over history). Stateful AI needs a memory tier that is large, relatively fast, low-power, and persistent. DRAM is too small and too expensive. NVMe is too slow. HBF is built to be exactly what sits between them.

The technical challenges are real — latency hiding, wear management, standardization, kernel integration, write amplification, FTL overhead. None of them are insurmountable. The physics of NAND flash and the packaging innovation of TSV stacking are both mature enough to support HBF. The remaining work is controller intelligence, standards alignment, and software ecosystem development.

David Patterson has called two of the most important architectural shifts in computing history — RISC and, with John Hennessy, the RISC-V open ISA. When he says HBF is the next bottleneck, the industry listens. This time, the bottleneck is not about processing power. It is about giving AI the memory it needs to think at depth.

"If the CPU was the core in the PC era and low-power technology was the core in the smartphone era, memory will be the core of the AI era."

— Prof. Kim Jung-ho, KAIST, February 2026