The next GPU scaling problem is not only compute. It is distance.
What is hybrid bonding?
Hybrid bonding directly bonds two semiconductor surfaces using dielectric bonding plus copper-to-copper bonding — eliminating solder microbumps as the primary die-to-die connection and enabling dramatically finer interconnect pitch.
Conventional microbump stacking achieves interconnect pitches of roughly 40–100 µm. Hybrid bonding cuts that to 1–10 µm — a 10× to 100× reduction. That tighter pitch translates directly into bandwidth density: more signal wires per square millimeter of bonded area. It also shortens the electrical path, which reduces signal energy per bit and lowers latency.
For GPUs, this matters because the scaling bottleneck is increasingly data movement, not compute. More tensor cores are only useful if the package can feed them with activations, weights, KV blocks, routing metadata, and synchronization signals at low energy and low latency.
Traditional microbump stacking Die A solder bump (~40–100 µm pitch) underfill solder bump Die B ───────────────────────────────── Hybrid bonding Die A copper pad + oxide |||||||||||| (~1–10 µm pitch) Die B copper pad + oxide
TSMC's SoIC platform is its primary vehicle for ultra-high-density 3D chiplet integration. Compared with conventional packaging, it improves bandwidth density, power integrity, signal integrity, and thermal management — by removing the high-impedance solder path between dies entirely.
| Dimension | Solder microbump | Hybrid bonding (Cu–Cu) | Improvement |
|---|---|---|---|
| Interconnect pitch | 40–100 µm | 1–10 µm | ~10–100× denser |
| Bandwidth density | ~1–5 Tbps/mm² | ~10–100 Tbps/mm² | ~10–20× higher |
| Energy per bit | ~5–20 pJ/bit | <1 pJ/bit feasible | ~10–50× lower |
| Latency | Package/board path latency | Sub-ns die-to-die | Order-of-magnitude lower |
| Footprint | Requires underfill gap | Near-zero z-height overhead | Dramatically smaller vertical stack |
Why GPUs need this
The monolithic GPU die is running into practical limits: reticle size, yield, cost, heat density, and routing complexity. Chiplets offer a path forward — but only if the links between them are dense, efficient, and low-latency enough to be invisible to software.
Multi-GPU system = GPU + link + GPU + link + GPU
Each GPU is a strong device boundary with its own memory, scheduler, and locality. Communication between them is expensive and latency-visible to software.
Multi-die GPU fabric = compute tile + cache tile + IO tile + memory tile bonded densely enough to behave like one accelerator
The package itself becomes a compute fabric. The inter-tile interface is a packaging question, not a network question.
| Hybrid bonding benefit | Why GPU architects care |
|---|---|
| Higher bandwidth density | More wires per mm² between compute tiles, cache tiles, base dies, or IO dies — without adding package area. |
| Lower energy per bit | Shorter electrical paths with no solder parasitics; moving data inside the package can approach SRAM-class energy. |
| Lower latency | Signals cross a bonded Cu–Cu interface rather than board-level or interposer paths. Sub-nanosecond die-to-die access becomes feasible. |
| Smaller form factor | Compute, SRAM, IO, and control logic can be integrated tightly in Z-height without large underfill gaps. |
| Chiplet-based yield scaling | Each functional tile can be tested and known-good before bonding, improving package-level yield for large accelerators. |
Hybrid bonding is physical packaging, not the protocol
NVLink, Infinity Fabric, UCIe-class fabrics, and custom coherent interconnects operate above the physical package layer. Hybrid bonding does not replace those protocols — it gives them a denser and more efficient physical substrate.
This distinction matters because it clarifies what hybrid bonding does and does not solve. It collapses the physical distance between die interfaces and reduces energy per bit at that boundary. It does not automatically solve cache coherence, scheduling, or programming-model complexity. Those remain protocol and software problems.
Protocol layer NVLink, Infinity Fabric, UCIe-like protocols, cache-coherent fabrics, custom AI fabrics Electrical / PHY layer SerDes, parallel die-to-die PHYs, clocking, equalization, power delivery Physical packaging layer substrate, silicon interposer, bridge, microbumps, hybrid bonding ← lives here
From GPU cards to GPU fabrics
Traditional GPU-to-GPU communication crosses several domain boundaries: die edge, package substrate, board-level copper, a switch or retimer, then the reverse path into the second package. Each boundary adds latency, power, and potential congestion.
Hybrid bonding collapses the innermost of those boundaries — the transition from die-level signaling to package-level routing — making it effectively free compared to any off-package link.
Traditional path GPU die ↓ package / interposer ↓ board-level link / switch / retimer ↓ package / interposer GPU die Hybrid-bonded path GPU tile A ↓ ultra-dense Cu–Cu bonded interface GPU tile B (sub-ns, <1 pJ/bit)
Why AI workloads make this urgent
Training and inference both stress communication in ways that compound as models scale. As tensor cores get faster, the bottleneck shifts toward feeding them, synchronizing them, and keeping memory structures close to compute.
Training stresses
Large-model training is dominated by collective communication. Tensor parallelism requires all-reduce operations across every forward and backward pass. Pipeline parallelism requires constant activation transfers between stages. Gradient synchronization, optimizer state sharding, and activation checkpointing all create sustained inter-device data movement.
Inference stresses
Inference at scale introduces its own communication pressure. KV cache must be read and written per token, per layer, and grows with context length. Prefill and decode are often split across accelerators. Mixture-of-experts models require dynamic expert routing across devices every forward pass.
Hybrid bonding is not about making a GPU prettier inside the package. It is about reducing the cost of coordination. In the AI era, coordination cost — moving activations, weights, KV blocks, routing decisions, and synchronization signals — is becoming one of the defining bottlenecks of accelerator infrastructure. The compute throughput problem is largely solved. The data movement problem is not.
AMD MI300: the package as architecture
AMD's MI300 family is the most instructive real-world example of how advanced packaging is changing accelerator architecture. It is not a GPU in the traditional sense — it is a package-level system.
The MI300X integrates eight CDNA 3 GPU compute dies and four HBM3 stacks on a unified silicon interposer — all in a single package. The MI300A variant goes further: it adds three Zen 4 CPU dies and a shared HBM3 pool, creating an APU where CPU and GPU tiles share the same memory space through the interposer fabric rather than a PCIe link. The result is a near-elimination of the CPU-to-GPU PCIe transfer penalty that dominates latency in classical heterogeneous systems.
The inter-die links inside MI300 run via AMD's XGMI (Infinity Fabric over package), carried across the interposer at bandwidths that would be impractical over any board-level connection. The compute tiles, memory controllers, and HBM stacks all communicate as if they were one large die — but each was manufactured separately and selected for quality before assembly.
The deeper lesson is not just "AMD used chiplets." The lesson is that the package is becoming part of the architecture. The memory topology, the coherence domain, the tile placement, and the link structure are architectural decisions — they just happen to live in the packaging domain now.
Intel Foveros Direct and logic stacking
Intel's Foveros Direct 3D platform removes traditional solder microbumps entirely for die-to-die connections, using direct copper-to-copper hybrid bonding at sub-10 µm pitch. This enables a new category of active die stacking where compute logic sits vertically above another compute or control logic die with a bonded interface dense enough to share cache or register state across die boundaries.
The architectural implication is significant: with a dense enough bonded interface, the distinction between "one die with two functional blocks" and "two dies vertically bonded" starts to blur from a microarchitecture perspective. A CPU tile on top of a large SRAM tile can behave like a monolithic design with an enormous L2 cache, while being manufactured on two different process nodes optimized for their respective functions.
The most interesting GPU designs over the next few years may combine compute dies, SRAM dies, IO dies, base routing dies, and power-management dies in layered configurations — with hybrid bonding as the vertical glue that makes them coherent.
The realistic path is not "two GPUs stacked like pancakes"
Stacking a full hot GPU die directly on another full hot GPU die is thermally brutal. The power density of a high-end GPU compute die is already at the edge of what cooling systems can manage. Doubling that density in a vertical stack creates heat extraction challenges that are not easily solvable with current cooling technology.
Compute ↔ SRAM
Hybrid-bonded cache or SRAM tiles close to compute can dramatically improve locality for working sets, metadata, and KV-cache structures. A large SRAM tile bonded below compute dies can act as an enormous on-package L3 without occupying compute die area or adding off-package bandwidth pressure.
Compute ↔ base die
A base die can carry routing fabric, control logic, scheduling assist, IO multiplexing, power management circuitry, or package-level telemetry — all manufactured on a mature, cost-optimized process while the compute tiles use the most advanced node available.
Compute ↔ IO tile
Dense bonding to IO tiles could shorten electrical paths into NVLink-class links, PCIe/CXL blocks, or future optical IO engines — enabling high-bandwidth off-package fabric connections from a specialized die rather than burning valuable compute die area on SerDes circuitry.
More plausible near-term architecture direction
GPU compute tiles (leading-edge node)
bonded to
cache / SRAM tiles (SRAM-optimized node)
bonded to
routing / fabric / IO / control die (mature node)
The software consequence: locality becomes more complicated
A multi-die GPU fabric creates a new placement problem. If every tile does not have perfectly uniform access to memory, cache, and IO — and they will not — software must either understand package-level locality or be protected from it by a runtime that manages placement automatically.
This is structurally similar to NUMA in multi-socket CPU systems, or to CXL memory tiering in heterogeneous memory hierarchies. The hardware presents a topology with non-uniform latency and bandwidth characteristics; the software stack must decide how much of that topology to expose versus abstract.
Hybrid bonding reduces the penalty of crossing die boundaries, but it does not eliminate topology. A dense bonded fabric is still a fabric. It has placement constraints, routing contention, thermal limits, and power delivery boundaries. The software that runs on it — CUDA, ROCm, custom ML runtimes — will eventually need to model it.
The most sophisticated AI serving systems already track tensor locality, KV block placement, and compute affinity at a granular level. Package-level locality adds one more dimension to that problem — but it is a dimension that can, with the right runtime abstractions, be hidden from the model developer while still being exploited for performance.
Questions the runtime may need to answer
These are not hypothetical questions — they are the natural extension of scheduling decisions that high-performance AI serving systems already make at the cluster level, now applied at the package level:
Where is the tensor in the package topology? Which tile owns the active KV block? Which compute tile is closest to the relevant HBM stack? Which tile has cache residency for this layer's weights? Which tile should run the next decode step? Which MoE expert should be placed near which memory region?
The runtime that solves package-level locality well will extract meaningfully more performance from the same hardware than one that treats the package as a black box.
KV cache and long-context inference
KV cache grows with batch size, sequence length, number of layers, and hidden dimension. As context windows expand from 8K to 128K to 1M tokens, the placement and movement of KV blocks becomes an infrastructure problem that touches everything from memory hierarchy design to network topology to scheduling policy.
KV cache pressure batch × seq_len × layers × hidden_dim ↑ grows with each scale axis
A future hybrid-bonded package that places dedicated SRAM or cache tiles directly below decode compute tiles could reduce attention-related memory bandwidth pressure significantly — keeping hot KV blocks at sub-ns access latency rather than requiring repeated HBM reads per decode step.
Hybrid bonding and optical IO
As rack-scale and cluster-scale AI systems grow, copper signaling becomes expensive in both power consumption and reach. Driving signals over board traces at 100+ Gbps per lane requires substantial SerDes power. The natural direction is to bring electrical-to-optical conversion as close to the compute die as possible.
Hybrid bonding could enable GPU or fabric compute dies to attach directly to specialized optical IO tiles — with a very short copper electrical path that terminates at an electro-optic engine inside the same package. The result is a package that speaks native optical to the network fabric rather than converting at the module edge.
Package stack with co-packaged optics
GPU compute tiles
SRAM / cache / fabric die
HBM stacks
Optical IO tile / E-O engines
↓
optical fiber to rack/cluster fabric
A possible future GPU package
The future AI GPU may look less like a single chip and more like a miniature data center inside one advanced package — with each functional layer manufactured on its optimal process node and integrated through hybrid bonding.
HBM HBM HBM
| | |
+--------------------------------------+
| Package-Level Fabric |
| |
| +----------+ +----------+ |
| | Compute | | Compute | |
| | Tile 0 | | Tile 1 | |
| +----------+ +----------+ |
| || hybrid bonding (Cu–Cu) || |
| +------------------------------+ |
| | SRAM / KV / Routing Layer | |
| +------------------------------+ |
| || hybrid bonding (Cu–Cu) || |
| +------------------------------+ |
| | IO / Power / Control Die | |
| +------------------------------+ |
| |
+--------------------------------------+
| | |
optical / NVLink / PCIe / CXL
The hard problems
Hybrid bonding is powerful but unforgiving. It changes the constraints of architecture, manufacturing, test, thermals, and software in ways that are more demanding than traditional packaging.
Thermals
Stacking active logic increases heat density significantly. Compute-on-compute stacking is especially difficult for high-power GPU dies already operating near the limits of available cooling. Effective heat extraction requires careful die-level thermal design and co-design with the cooling system.
Power delivery
Dense vertical integration complicates current delivery paths. IR drop, electromigration, and power noise all become harder to manage when large currents must travel through narrow vertical connections into stacked dies with limited decoupling options.
Yield and known-good-die
Bonded failures can destroy expensive package value. Known-good-die testing becomes essential — and challenging, because testing a bare die at full speed and temperature before bonding requires infrastructure that did not previously exist at scale in the packaging supply chain.
Co-design complexity
Floorplan, package, power grid, clocking, cache hierarchy, and fabric topology must all be designed together. The traditional separation between chip design and package design breaks down when the package is part of the architecture.
Software topology exposure
The runtime may need to expose package-level locality without overwhelming developers with new NUMA-like complexity. The right abstraction layer — something that hides topology where possible and exposes it where performance demands it — does not yet exist in a mature form for GPU AI workloads.
Supply chain and capacity
Advanced packaging capacity is already a bottleneck in AI accelerator production. Hybrid bonding requires specialized bonding equipment, ultra-clean surfaces, and careful process control that is not yet available at the scale that AI demand requires. Architecture must meet manufacturing reality, not just physical possibility.
Hybrid bonding turns packaging into architecture. In the AI era, the GPU is no longer just a die. It is becoming a bonded fabric of compute, memory, cache, IO, and control.
The thesis
GPU-to-GPU hybrid bonding marks the beginning of a transition from multi-GPU communication to multi-die GPU fabrics. The near-term use case is not two full GPU dies stacked directly on top of each other — the thermal constraints are too severe and the yield challenges too significant for that to be practical at high volume in the near term.
The more practical and powerful direction is compute tiles bonded to SRAM tiles, base fabric dies, IO tiles, memory-controller tiles, and neighboring compute tiles — each manufactured on its optimal process node, integrated into a coherent package-level system through dense hybrid bonding.
Over time, this could produce AI accelerators where the package itself becomes a miniature data center: compute, memory, routing, IO, and control logic integrated into a tightly bonded physical fabric. AMD's MI300 is an early and imperfect version of this idea. What follows will be more aggressive.
The question of who wins the next generation of AI accelerator infrastructure is not simply a question of who has the best tensor core or the most HBM bandwidth.
The winner may be the company that can move data across compute tiles at the lowest energy, lowest latency, and highest density — while exposing a clean enough programming model that the software stack can exploit it.
That is a packaging problem, a microarchitecture problem, a runtime problem, and a systems software problem at once. The companies that treat it as all four simultaneously will have a significant and durable advantage.
Selected sources
References for the technology background and industry context in this piece.
- TSMC SoIC / 3DFabric: https://3dfabric.tsmc.com/english/dedicatedFoundry/technology/SoIC.htm
- AMD CDNA 3 / MI300 architecture white paper: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
- MI300A exascale APU (ISCA 2024): https://dci.dci-gitlab.cines.fr/webextranet/_downloads/484384c0c918544deb72dfe528d6affa/2024-isca-MI300A-exascale.pdf
- Intel Foveros Direct 3D technical brief: https://www.intel.com/content/dam/www/central-libraries/us/en/documents/2025-11/foveros-direct-3d-tech-brief.pdf
- Samsung advanced packaging overview: https://semiconductor.samsung.com/foundry/advanced-package/