AI HardwareHBMMemory ArchitectureChip Physics

HBM Explained: How High Bandwidth Memory Is Actually Built

Every essay in this corpus mentions HBM. None of them explains what it is at the physical level — how DRAM dies are stacked, what a TSV actually does, why the interposer exists, and why stacking is the only path to bandwidth density that silicon physics allows.

By Manish KL · April 2026 · ~18 min read · Hardware Essay
Abstract. HBM (High Bandwidth Memory) is not a faster version of DDR. It is a fundamentally different packaging architecture — one that solves the bandwidth problem by stacking multiple DRAM dies vertically, connecting them with thousands of Through-Silicon Vias (TSVs), and placing the entire stack millimetres from the GPU die on a silicon interposer. This essay explains the physics and engineering behind that construction: why planar DRAM hit a bandwidth wall, what TSVs are and how they work, how the interposer replaces the PCB for high-speed signals, what limits HBM stack height, and why HBM3e at 3.35 TB/s is still not enough for the workloads it serves.
3.35 TB/sHBM3e peak bandwidth per stack — vs. ~89 GB/s for DDR5
~10,000TSVs per DRAM die in an HBM stack, vs. ~1,300 signal pins in a DDR5 DIMM
12maximum DRAM die layers in an HBM3e stack (12-Hi)
~3.9 pJenergy per bit transferred over HBM3e — vs. ~15 pJ/bit for DDR5
Contents
  1. The planar DRAM bandwidth wall
  2. Through-Silicon Vias: the vertical wire that changes everything
  3. The HBM stack: how dies are assembled
  4. The silicon interposer: why HBM cannot sit on a PCB
  5. 2.5D integration: GPU and HBM on the same interposer
  6. The channel architecture: 1024-bit buses and memory controllers
  7. HBM generations: HBM1 to HBM3e, by the numbers
  8. What limits HBM: stack height, thermal, yield, and cost
  9. HBM vs. DDR5 vs. LPDDR5: the right tool for each job
  10. What comes after HBM3e
  11. HBM is not the destination

1. The planar DRAM bandwidth wall

To understand why HBM exists, you need to understand what was wrong with DDR DRAM and why more speed could not be extracted from it.

Conventional DRAM — DDR4, DDR5, LPDDR5 — connects to the CPU or GPU through a memory bus on the PCB. The bus is a set of signal traces etched into the circuit board, running from the processor's memory controller pins to the DRAM module's edge connector. The bandwidth of this bus is determined by two numbers: the bus width (how many bits are transferred in parallel) and the transfer frequency (how fast each bit toggles).

DDR5 runs at up to 6,400 MT/s on a 64-bit channel, giving approximately 51.2 GB/s per channel. A CPU with two DDR5 channels delivers ~102 GB/s. Even four channels — the maximum for workstation CPUs — gives ~204 GB/s. Compare this to the 3.35 TB/s of an H200's HBM3e: DDR5 is 33× slower at best.

Could DDR5 just run faster? This is where physics intervenes. The signal traces on a PCB behave as transmission lines. At high frequencies:

1
Impedance mismatch. PCB traces have parasitic capacitance and inductance. As frequency rises, signal integrity degrades — reflections, crosstalk, and inter-symbol interference corrupt data. This is manageable at 6.4 GT/s but becomes intractable beyond ~12.8 GT/s without extremely exotic board materials and signal conditioning.
2
Pin count constraint. A DDR5 DIMM has ~1,300 pins. Adding more bandwidth means either more pins or faster signalling — and adding pins means a larger physical connector and longer trace lengths, which worsen signal integrity further. There is a physical ceiling.
3
Power per bit. Long PCB traces must drive significant capacitive loads. Each bit transition dissipates energy in those traces. At DDR5 speeds, IO power can exceed memory array power. Increasing speed further increases power non-linearly. The energy budget runs out before the signalling budget does.

The conclusion was reached in the early 2010s: the PCB-based memory bus had reached its practical bandwidth ceiling. Getting 10× more bandwidth required a fundamentally different packaging approach — not faster signalling, but more parallel signalling through a much shorter, much wider connection. The answer was vertical stacking.

2. Through-Silicon Vias: the vertical wire that changes everything

A Through-Silicon Via (TSV) is exactly what the name says: a vertical electrical conductor that passes through the body of a silicon die from top to bottom. It is a hole drilled through silicon and filled with a conducting material (typically copper or tungsten) that creates a direct electrical connection between the top and bottom surfaces of the die.

Through-Silicon Via (TSV) — physical structure Silicon die (cross-section) — typically 50–100 µm thick back-end metal micro-bumps Silicon bulk (bulk region between TSVs) TSV copper ~5µm dia ~75 µm die thick next DRAM die below (connected via micro-bumps at TSV landing pads)
Fig 1. Cross-section of a DRAM die showing Through-Silicon Vias (TSVs). Each TSV is a ~5 µm copper conductor drilled through the ~75 µm silicon bulk. Micro-bumps at the bottom surface connect to the die below. An HBM3e die has approximately 10,000 TSVs providing a 1,024-bit-wide parallel connection between stacked dies.

The critical numbers: a TSV is approximately 5 µm in diameter, carries one signal, and can be pitched at 40–50 µm centre-to-centre. On a die edge of ~10mm, you can fit 200+ TSVs in a row. Across the full die area, thousands of TSVs are possible. HBM3e uses approximately 10,000 TSVs per die — creating a 1,024-bit parallel data bus through the stack, plus power, ground, and control signals.

Compare this to a DDR5 DIMM's ~1,300 total pins across the entire module. HBM has roughly 8× more signal paths, each of which is shorter (through ~75 µm of silicon, not across centimetres of PCB), lower impedance, and lower capacitance. The result is 3.35 TB/s from a structure that fits on a silicon die roughly 7.75mm × 11.87mm.

3. The HBM stack: how dies are assembled

An HBM stack is a vertical assembly of DRAM dies — up to 12 in HBM3e — plus a base logic die at the bottom. Each DRAM die contains the memory arrays and some local circuitry. The base logic die contains the memory controller, PHY interface, power management, and the interposer-facing connections.

HBM3e stack — 12-Hi configuration (12 DRAM dies + base logic die) Base logic die Memory controller · PHY · Power management · Interposer I/O DRAM die 1 — 24 Gb per die (HBM3e) DRAM die 2 DRAM die 3 DRAM die 4 · · · · · · dies 5–11 DRAM die 12 (top) ~720 µm total height (< 1 mm) TSV columns ~10,000 per die 12 × 24 Gb = 36 GB total capacity per stack (HBM3e 12-Hi) All dies connected in parallel through shared TSV columns — 1,024-bit bus to base logic die
Fig 2. HBM3e stack in 12-Hi configuration. Twelve 24 Gb DRAM dies are stacked vertically above a base logic die, all connected through shared TSV columns. The total stack height is under 1 mm. Each pair of adjacent dies is bonded with micro-bumps (~40 µm pitch) at the TSV landing pads. The base logic die presents a single 1,024-bit interface to the silicon interposer below.

The assembly process is precise to the micron. Each die is thinned from its as-fabricated thickness of ~700 µm down to ~50–75 µm using a process called back-grinding — essentially sanding the back of the die with a diamond-tipped grinder until the TSVs are exposed on the back surface. The thinned dies are then picked and placed on top of each other using thermocompression bonding, applying heat (~250–300°C) and pressure to form reliable micro-bump connections at each TSV landing pad.

4. The silicon interposer: why HBM cannot sit on a PCB

The HBM stack is assembled. The GPU die is fabricated. The next problem: how do you connect them? The micro-bump pitch inside the HBM stack is ~40 µm. The signal density required to carry a 1,024-bit bus between the GPU and HBM at full bandwidth requires interconnects at similar pitch. A standard PCB cannot achieve this — the finest-pitch PCB traces are ~50 µm, but the routing density and signal integrity at that pitch over PCB-length distances are inadequate.

The solution is a silicon interposer — a passive silicon substrate that sits beneath both the GPU die and the HBM stacks, providing ultra-fine-pitch routing between them. Because the interposer is made of silicon (the same material as the dies), it can be lithographically patterned with trace widths and spacings of 1–5 µm — the same processes used to make chips — providing routing density 10–50× higher than the best organic PCB.

2.5D packaging: GPU + HBM on silicon interposer Organic substrate / PCB (C4 solder bumps to interposer) Silicon interposer ultra-fine routing (1–5 µm traces) · micro-bump landing pads · passive routing only GPU die SMs · Tensor Cores · L2 cache TSMC CoWoS · ~800 mm² HBM3e stack 1 36 GB 1,024-bit HBM3e stack 2 36 GB 1,024-bit HBM3e stack 3 36 GB 1,024-bit 1,024-bit bus routing through interposer (3× channels = 3,072-bit total to GPU) H200 SXM5 configuration: 1 GPU die + 8 HBM3e stacks on TSMC CoWoS interposer (~141 GB total, 3.35 TB/s) (diagram simplified to 3 stacks for clarity)
Fig 3. 2.5D integration: GPU die and HBM3e stacks co-mounted on a silicon interposer. The interposer provides the ultra-fine-pitch routing layer that carries multiple 1,024-bit HBM channels between the GPU's memory controllers and the HBM base logic dies. The H200 SXM5 uses TSMC's CoWoS (Chip-on-Wafer-on-Substrate) process with 8 HBM3e stacks for 141 GB total and 3.35 TB/s aggregate bandwidth.

5. 2.5D integration: what it means and why "2.5D" rather than 3D

The packaging architecture described above — GPU and HBM side-by-side on an interposer — is called 2.5D integration. The name reflects the geometry: dies are not fully stacked on top of each other (which would be true 3D), but they are connected at much shorter distances than a PCB allows, on a shared substrate that provides dense routing between them.

True 3D integration — stacking the GPU directly on top of HBM — is technically possible but practically difficult for a die the size of a GPU. The thermal density would be unmanageable: a GPU dissipating 700W on top of DRAM that is heat-sensitive (DRAM reliability degrades above ~85°C) would destroy the memory. The 2.5D approach keeps them thermally separate while achieving near-3D interconnect density through the interposer routing.

6. The channel architecture: 1,024-bit buses and memory controllers

The bandwidth of an HBM stack is determined by the product of bus width, transfer frequency, and data encoding. HBM3e uses:

HBM3e bandwidth calculation — per stack
Bus width: 1,024 bits (128 bytes per transfer)
Transfer rate: 6.4 GT/s (gigatransfers per second)
DDR encoding: data transferred on both rising and falling clock edges

Bandwidth/stack = 1,024 bits × 6.4 GT/s / 8 bits/byte
= 1,024 × 6.4 / 8 GB/s
= 819.2 GB/s per stack

H200 SXM5: 8 stacks × 819.2 GB/s ≈ 6.55 TB/s aggregate
(NVIDIA specs 3.35 TB/s — this reflects practical efficiency, ~51% of theoretical peak,
accounting for refresh cycles, ECC overhead, and command/address overhead)

Each HBM stack presents 8 independent pseudo-channels (4 per die pair in HBM3e), each 128 bits wide, each with its own row/column address state. This allows the GPU's memory controllers to issue 8 independent memory transactions per stack in parallel — reducing effective access latency by allowing overlapping of address setup and data transfer across channels.

7. HBM generations: HBM1 to HBM3e, by the numbers

GenerationReleasedBus widthSpeed (GT/s)BW/stackMax capacityNotable use
HBM120151,024-bit1.0128 GB/s4 GB (4-Hi)AMD Fury X
HBM220161,024-bit2.0256 GB/s8 GB (4-Hi)NVIDIA V100
HBM2e20191,024-bit3.6461 GB/s16 GB (8-Hi)NVIDIA A100, AMD MI250
HBM320221,024-bit6.4819 GB/s24 GB (8-Hi)NVIDIA H100 SXM
HBM3e20241,024-bit6.4–9.6819 GB/s–1.2 TB/s36 GB (12-Hi)NVIDIA H200, AMD MI300X
HBM4 (projected)20262,048-bit8.0–12.0~2 TB/s64+ GBNVIDIA Vera Rubin, AMD MI350

The progression tells a clear story: bus width has been stable at 1,024 bits since HBM1 (limited by interposer routing density and micro-bump count), so bandwidth gains have come from speed increases and stack height increases. HBM4 breaks this pattern by doubling the bus width to 2,048 bits — which requires a new interposer generation and new micro-bump pitch — in addition to speed increases.

8. What limits HBM: stack height, thermal, yield, and cost

HBM3e supports 12-Hi stacking. Why not 16-Hi or 24-Hi? Four physical constraints cap stack height.

1
Thermal resistance. Heat generated by the DRAM array in the top die must travel through all the dies below it before reaching the cooling solution. Each silicon die layer adds ~0.5°C/W of thermal resistance. A 12-Hi stack with the bottom die running warm already challenges the thermal budget of the top die. At 16-Hi, the top die would reach temperatures that compromise reliability and retention.
2
Signal integrity through TSVs. TSVs are not perfect conductors. Each introduces a small but non-zero parasitic capacitance (~10–40 fF) and inductance. As signals travel through more die-to-die junctions, these parasitics accumulate. At 12 layers the signal integrity margin is manageable; beyond that, driver sizing and repeater insertion become necessary and costly.
3
Assembly yield. Each micro-bump junction is a potential failure point. With ~10,000 TSV connections per die interface, the probability of at least one defective joint in a 12-Hi stack is already non-trivial and must be managed through redundancy and post-bond testing. Adding more layers multiplies the defect probability.
4
Total stack height and packaging constraints. A 12-Hi HBM3e stack is approximately 720 µm tall. GPU dies on the same interposer are typically 700–900 µm thick (after packaging). The combined height determines the substrate-to-lid gap and impacts the thermal interface material thickness. Adding more DRAM layers changes the mechanical fit of the entire package assembly.

9. HBM vs. DDR5 vs. LPDDR5: the right tool for each job

Memory typeBW/deviceCapacity/deviceEnergy (pJ/bit)Cost/GBIdeal use case
HBM3e819 GB/s36 GB/stack~3.9$10–15AI accelerators needing maximum bandwidth at tight energy budgets
GDDR7~192 GB/s/chip16–32 GB/chip~6$4–6Gaming GPUs, consumer inference at lower power
DDR551.2 GB/s/channel16–128 GB/DIMM~15$3–5CPU main memory — capacity-optimised, latency-tolerant workloads
LPDDR5X~68 GB/s/pkgup to 64 GB~5$4–7Mobile AI, Apple Silicon unified memory, edge inference
CXL DRAM~50–80 GB/sTBs per module~18$2–4Capacity expansion for large context windows, KV offload

HBM is not universally better than DDR5 — it is specifically optimised for the scenario where bandwidth per watt and bandwidth per unit area matter more than capacity. At $10–15/GB, HBM costs 3–5× more per gigabyte than DDR5. For a CPU managing a database workload where 512 GB of memory is needed and bandwidth is not the bottleneck, DDR5 is the right choice. For an H200 running a 70B parameter model where 3.35 TB/s of bandwidth is the binding constraint, HBM3e is the only viable architecture.

10. What comes after HBM3e

HBM4, targeting 2026–2027 deployment, doubles the bus width to 2,048 bits per stack and increases per-pin rates to 8–12 GT/s. The projected bandwidth per stack is approximately 2 TB/s — nearly a 2.5× improvement over HBM3e. Capacity per stack reaches 64+ GB through higher-density DRAM dies (enabled by sub-10 nm DRAM process nodes at SK Hynix and Samsung).

The interposer must also evolve. Routing a 2,048-bit bus from each HBM4 stack to the GPU die requires approximately twice the silicon interposer routing density of HBM3e. TSMC's CoWoS-S (with interposer) and Intel's EMIB (Embedded Multi-die Interconnect Bridge) are the two primary interposer technologies competing for this next generation.

Beyond HBM4, the industry is exploring near-memory compute — placing processing elements inside the base logic die of the HBM stack, so that common operations (attention scoring, activation functions, simple reductions) can be performed at the full 2 TB/s bandwidth of the TSV bus rather than having data travel all the way to the GPU die for computation. This is a meaningful architectural shift: instead of moving all data to compute, move some compute to the data.

11. HBM is not the destination

HBM is the best currently manufacturable answer to the memory bandwidth problem. But it is not the final answer. The 3.35 TB/s of an H200's HBM3e is still 200× slower than the on-chip SRAM bandwidth available to the same GPU's tensor cores. Every token generation still requires dragging 70 GB across that 200× gap.

HBM3e is better than HBM2e, which was better than HBM2. Each generation closes the gap somewhat — HBM4 at 2 TB/s per stack will help. But the arithmetic intensity of decode does not improve with HBM bandwidth. As long as autoregressive inference requires reading all model weights per token, and as long as those weights are measured in tens of gigabytes, the memory bandwidth problem will persist regardless of which HBM generation is installed.

Understanding HBM is not an end in itself — it is the foundation for understanding why architectures like wafer-scale SRAM, near-memory compute, and model quantization are not optimisations but architectural necessities. The physics of silicon packaging, TSVs, and interposers determines the ceiling. Every inference architecture decision is made in the shadow of that ceiling.

What this essay does not cover
This essay focuses on HBM's physical construction and bandwidth characteristics. It does not cover: the manufacturing process for TSV drilling and filling in detail, the specific ECC schemes used in HBM stacks, the memory controller architecture inside the base logic die, or the detailed comparison of CoWoS versus EMIB interposer technologies. The economics of HBM manufacturing — why SK Hynix holds a leading position and what that means for supply constraints — is also a separate story.