Every essay in this corpus mentions HBM. None of them explains what it is at the physical level — how DRAM dies are stacked, what a TSV actually does, why the interposer exists, and why stacking is the only path to bandwidth density that silicon physics allows.
To understand why HBM exists, you need to understand what was wrong with DDR DRAM and why more speed could not be extracted from it.
Conventional DRAM — DDR4, DDR5, LPDDR5 — connects to the CPU or GPU through a memory bus on the PCB. The bus is a set of signal traces etched into the circuit board, running from the processor's memory controller pins to the DRAM module's edge connector. The bandwidth of this bus is determined by two numbers: the bus width (how many bits are transferred in parallel) and the transfer frequency (how fast each bit toggles).
DDR5 runs at up to 6,400 MT/s on a 64-bit channel, giving approximately 51.2 GB/s per channel. A CPU with two DDR5 channels delivers ~102 GB/s. Even four channels — the maximum for workstation CPUs — gives ~204 GB/s. Compare this to the 3.35 TB/s of an H200's HBM3e: DDR5 is 33× slower at best.
Could DDR5 just run faster? This is where physics intervenes. The signal traces on a PCB behave as transmission lines. At high frequencies:
The conclusion was reached in the early 2010s: the PCB-based memory bus had reached its practical bandwidth ceiling. Getting 10× more bandwidth required a fundamentally different packaging approach — not faster signalling, but more parallel signalling through a much shorter, much wider connection. The answer was vertical stacking.
A Through-Silicon Via (TSV) is exactly what the name says: a vertical electrical conductor that passes through the body of a silicon die from top to bottom. It is a hole drilled through silicon and filled with a conducting material (typically copper or tungsten) that creates a direct electrical connection between the top and bottom surfaces of the die.
The critical numbers: a TSV is approximately 5 µm in diameter, carries one signal, and can be pitched at 40–50 µm centre-to-centre. On a die edge of ~10mm, you can fit 200+ TSVs in a row. Across the full die area, thousands of TSVs are possible. HBM3e uses approximately 10,000 TSVs per die — creating a 1,024-bit parallel data bus through the stack, plus power, ground, and control signals.
Compare this to a DDR5 DIMM's ~1,300 total pins across the entire module. HBM has roughly 8× more signal paths, each of which is shorter (through ~75 µm of silicon, not across centimetres of PCB), lower impedance, and lower capacitance. The result is 3.35 TB/s from a structure that fits on a silicon die roughly 7.75mm × 11.87mm.
An HBM stack is a vertical assembly of DRAM dies — up to 12 in HBM3e — plus a base logic die at the bottom. Each DRAM die contains the memory arrays and some local circuitry. The base logic die contains the memory controller, PHY interface, power management, and the interposer-facing connections.
The assembly process is precise to the micron. Each die is thinned from its as-fabricated thickness of ~700 µm down to ~50–75 µm using a process called back-grinding — essentially sanding the back of the die with a diamond-tipped grinder until the TSVs are exposed on the back surface. The thinned dies are then picked and placed on top of each other using thermocompression bonding, applying heat (~250–300°C) and pressure to form reliable micro-bump connections at each TSV landing pad.
The HBM stack is assembled. The GPU die is fabricated. The next problem: how do you connect them? The micro-bump pitch inside the HBM stack is ~40 µm. The signal density required to carry a 1,024-bit bus between the GPU and HBM at full bandwidth requires interconnects at similar pitch. A standard PCB cannot achieve this — the finest-pitch PCB traces are ~50 µm, but the routing density and signal integrity at that pitch over PCB-length distances are inadequate.
The solution is a silicon interposer — a passive silicon substrate that sits beneath both the GPU die and the HBM stacks, providing ultra-fine-pitch routing between them. Because the interposer is made of silicon (the same material as the dies), it can be lithographically patterned with trace widths and spacings of 1–5 µm — the same processes used to make chips — providing routing density 10–50× higher than the best organic PCB.
The packaging architecture described above — GPU and HBM side-by-side on an interposer — is called 2.5D integration. The name reflects the geometry: dies are not fully stacked on top of each other (which would be true 3D), but they are connected at much shorter distances than a PCB allows, on a shared substrate that provides dense routing between them.
True 3D integration — stacking the GPU directly on top of HBM — is technically possible but practically difficult for a die the size of a GPU. The thermal density would be unmanageable: a GPU dissipating 700W on top of DRAM that is heat-sensitive (DRAM reliability degrades above ~85°C) would destroy the memory. The 2.5D approach keeps them thermally separate while achieving near-3D interconnect density through the interposer routing.
The bandwidth of an HBM stack is determined by the product of bus width, transfer frequency, and data encoding. HBM3e uses:
Each HBM stack presents 8 independent pseudo-channels (4 per die pair in HBM3e), each 128 bits wide, each with its own row/column address state. This allows the GPU's memory controllers to issue 8 independent memory transactions per stack in parallel — reducing effective access latency by allowing overlapping of address setup and data transfer across channels.
| Generation | Released | Bus width | Speed (GT/s) | BW/stack | Max capacity | Notable use |
|---|---|---|---|---|---|---|
| HBM1 | 2015 | 1,024-bit | 1.0 | 128 GB/s | 4 GB (4-Hi) | AMD Fury X |
| HBM2 | 2016 | 1,024-bit | 2.0 | 256 GB/s | 8 GB (4-Hi) | NVIDIA V100 |
| HBM2e | 2019 | 1,024-bit | 3.6 | 461 GB/s | 16 GB (8-Hi) | NVIDIA A100, AMD MI250 |
| HBM3 | 2022 | 1,024-bit | 6.4 | 819 GB/s | 24 GB (8-Hi) | NVIDIA H100 SXM |
| HBM3e | 2024 | 1,024-bit | 6.4–9.6 | 819 GB/s–1.2 TB/s | 36 GB (12-Hi) | NVIDIA H200, AMD MI300X |
| HBM4 (projected) | 2026 | 2,048-bit | 8.0–12.0 | ~2 TB/s | 64+ GB | NVIDIA Vera Rubin, AMD MI350 |
The progression tells a clear story: bus width has been stable at 1,024 bits since HBM1 (limited by interposer routing density and micro-bump count), so bandwidth gains have come from speed increases and stack height increases. HBM4 breaks this pattern by doubling the bus width to 2,048 bits — which requires a new interposer generation and new micro-bump pitch — in addition to speed increases.
HBM3e supports 12-Hi stacking. Why not 16-Hi or 24-Hi? Four physical constraints cap stack height.
| Memory type | BW/device | Capacity/device | Energy (pJ/bit) | Cost/GB | Ideal use case |
|---|---|---|---|---|---|
| HBM3e | 819 GB/s | 36 GB/stack | ~3.9 | $10–15 | AI accelerators needing maximum bandwidth at tight energy budgets |
| GDDR7 | ~192 GB/s/chip | 16–32 GB/chip | ~6 | $4–6 | Gaming GPUs, consumer inference at lower power |
| DDR5 | 51.2 GB/s/channel | 16–128 GB/DIMM | ~15 | $3–5 | CPU main memory — capacity-optimised, latency-tolerant workloads |
| LPDDR5X | ~68 GB/s/pkg | up to 64 GB | ~5 | $4–7 | Mobile AI, Apple Silicon unified memory, edge inference |
| CXL DRAM | ~50–80 GB/s | TBs per module | ~18 | $2–4 | Capacity expansion for large context windows, KV offload |
HBM is not universally better than DDR5 — it is specifically optimised for the scenario where bandwidth per watt and bandwidth per unit area matter more than capacity. At $10–15/GB, HBM costs 3–5× more per gigabyte than DDR5. For a CPU managing a database workload where 512 GB of memory is needed and bandwidth is not the bottleneck, DDR5 is the right choice. For an H200 running a 70B parameter model where 3.35 TB/s of bandwidth is the binding constraint, HBM3e is the only viable architecture.
HBM4, targeting 2026–2027 deployment, doubles the bus width to 2,048 bits per stack and increases per-pin rates to 8–12 GT/s. The projected bandwidth per stack is approximately 2 TB/s — nearly a 2.5× improvement over HBM3e. Capacity per stack reaches 64+ GB through higher-density DRAM dies (enabled by sub-10 nm DRAM process nodes at SK Hynix and Samsung).
The interposer must also evolve. Routing a 2,048-bit bus from each HBM4 stack to the GPU die requires approximately twice the silicon interposer routing density of HBM3e. TSMC's CoWoS-S (with interposer) and Intel's EMIB (Embedded Multi-die Interconnect Bridge) are the two primary interposer technologies competing for this next generation.
Beyond HBM4, the industry is exploring near-memory compute — placing processing elements inside the base logic die of the HBM stack, so that common operations (attention scoring, activation functions, simple reductions) can be performed at the full 2 TB/s bandwidth of the TSV bus rather than having data travel all the way to the GPU die for computation. This is a meaningful architectural shift: instead of moving all data to compute, move some compute to the data.
HBM is the best currently manufacturable answer to the memory bandwidth problem. But it is not the final answer. The 3.35 TB/s of an H200's HBM3e is still 200× slower than the on-chip SRAM bandwidth available to the same GPU's tensor cores. Every token generation still requires dragging 70 GB across that 200× gap.
HBM3e is better than HBM2e, which was better than HBM2. Each generation closes the gap somewhat — HBM4 at 2 TB/s per stack will help. But the arithmetic intensity of decode does not improve with HBM bandwidth. As long as autoregressive inference requires reading all model weights per token, and as long as those weights are measured in tens of gigabytes, the memory bandwidth problem will persist regardless of which HBM generation is installed.
Understanding HBM is not an end in itself — it is the foundation for understanding why architectures like wafer-scale SRAM, near-memory compute, and model quantization are not optimisations but architectural necessities. The physics of silicon packaging, TSVs, and interposers determines the ceiling. Every inference architecture decision is made in the shadow of that ceiling.