1) The end of the “flat memory” era
For a long time, accelerator design benefited from a quiet simplification: all high-bandwidth memory was treated as essentially equivalent. That simplification made the memory model feel flat even when the packaging itself was very sophisticated.
The standard mental model was straightforward. Put one or more large compute dies in the center of the package. Surround them with HBM stacks. Let the memory controllers and on-package wiring do the rest. As long as every HBM stack sits very close to the compute die and the wiring remains short, dense, and highly synchronized, the compute fabric can treat the external memory system as a fast, almost uniform pool.
That assumption is now under pressure from all sides. Model sizes continue to expand. Activations, KV caches, embeddings, routing tables, and training state all compete for capacity. At the same time, it is no longer enough to improve peak FLOPs in isolation. The memory subsystem increasingly determines whether those FLOPs can be fed at all.
In other words, the future chip is not just a compute device. It is a hierarchy of storage and movement layers: on-die SRAM, near HBM, farther HBM reached through package structures, inter-chip fabrics, and eventually optical links that extend across the board and rack.
2) Why HBM hits a physical wall
High Bandwidth Memory works because it is radically different from a conventional, long-reach interface. The whole value proposition of HBM is tied to a very wide, very dense, very short connection. Samsung’s public HBM3E material, for example, advertises up to 1,180 GB/s per 12‑high stack at 9.2 Gbps, which gives a useful sense of the bandwidth class designers are operating in.[1]
That is not achieved with a narrow serialized link. It is achieved with a broad interface and a short physical reach. The memory wins by staying close.
What makes HBM powerful
- Ultra-wide interface
- Very high aggregate bandwidth
- Lower energy than long-reach serialized transport
- Excellent fit for bandwidth-hungry AI workloads
What makes HBM fragile
- Tight timing margins
- Signal integrity sensitivity
- Distance-dependent skew and loss
- Severe package layout constraints
The hidden consequence
- You cannot scale HBM count arbitrarily
- Capacity gets capped by geometry
- Compute scaling outruns memory scaling
- The package becomes the bottleneck
This is why “just add more HBM” is not a trivial suggestion. Once physical distance grows, electrical behavior changes. At that point, memory scaling stops being a pure logic problem and turns into a package physics problem.
3) What a logic bridge actually is
The key architectural idea is to stop treating HBM adjacency as sacred. Instead of insisting that every HBM stack sit directly beside compute, a package can insert small bridge structures that restore and carry signals farther across the substrate or interposer.
The important part is not the phrase “logic bridge.” The important part is the role: keep the interface electrically viable over a longer path without collapsing into a slow off-package transport model.
Think of the bridge as
- a distributed signal extension element
- a package-native repeater / retiming structure
- something much closer to “smart wiring” than to a processor
- a way to preserve high-density electrical connectivity beyond the old reach limit
Do not think of the bridge as
- a PCIe switch
- a NIC
- a general packet router
- a central memory controller hub aggregating traffic from every stack
Intel’s public material on EMIB is useful here because it shows how mainstream advanced packaging has already moved toward high-density in-package bridges. Intel describes EMIB as an embedded multi-die interconnect bridge that enables high-bandwidth communication between dies in one package.[2], [3] TSMC’s CoWoS public pages make the complementary point from the interposer side: advanced package platforms are explicitly being built for logic chiplets plus HBM at scale.[4], [5]
So the conceptual move here is not science fiction. It is a natural extension of where advanced packaging already is: denser interconnect, larger package fabrics, more chiplets, more HBM, and more package-level structure standing between compute and memory.
4) What the bridge is not: why this is not just “a custom chip in the middle”
A very natural misunderstanding is to imagine one custom bridge die sitting in the middle like a little switch:
HBM ─┐ HBM ─┼── [Bridge Chip] ── Compute HBM ─┘
That model is intuitive, but it is the wrong one for understanding why this can preserve near-HBM behavior. A central switching element would immediately raise questions about arbitration, buffering, serialization, and aggregated bottlenecks. It would behave less like HBM and more like a fabric endpoint.
The more accurate picture is distributed and geometric. Several bridge structures are embedded in the package, and different memory paths may traverse one or more of them. The compute chiplet still talks to HBM over dense package interconnect; the bridges simply allow some of those paths to go farther.
That distinction matters enormously. If the system collapsed onto a narrow serialized bottleneck, it would stop being HBM-like. The whole point is to preserve broad, local, electrical connectivity while changing the reach envelope.
5) The topology shift: from “compute surrounded by memory” to “compute embedded inside memory”
Once bridges exist, the package is no longer a ring of adjacent HBM around a center die. It becomes a true topology. Some stacks are one electrical neighborhood away. Others are two neighborhoods away. What matters is no longer just capacity; it is path.
This is the architectural shift that makes the entire conversation interesting: the package stops behaving like a flat pool and starts behaving like a memory fabric.
The practical consequence is simple but profound: a memory access now has a topology. There is “where” the line lives, not just “whether” the line exists.
6) Bandwidth survives; latency becomes nuanced
The reason this architecture is powerful is that it can preserve the fundamental strength of HBM—massive local bandwidth—without immediately collapsing into the much slower economics of board-level or rack-level links.
Public product data gives the right scale for intuition. NVIDIA’s H200 public page advertises 141 GB of HBM3E at 4.8 TB/s, and NVIDIA’s later Blackwell material advertises up to 192 GB HBM3E and 8 TB/s of HBM bandwidth, with two reticle-limited dies connected by a 10 TB/s chip-to-chip link inside the GPU.[6], [7], [8]
Those numbers matter because they show the class of performance advanced packages are already trying to sustain. The whole point of an embedded bridge structure is to extend reach without falling out of this performance envelope.
| Interconnect style | Typical shape | Bandwidth model | Latency character | Best use |
|---|---|---|---|---|
| Direct HBM | Ultra-wide, very short electrical | Hundreds of GB/s to TB/s per stack class | Lowest external-memory latency in package | Near memory |
| HBM via package bridge | Still package-local, extended electrical path | Near-HBM class if designed well | Slightly higher; path-dependent | Farther on-package memory |
| Chip-to-chip die interconnect | Dedicated package/chiplet link | Very high but structured | Higher than direct local HBM path | Multi-die compute fabrics |
| Board/rack optical or network | Serialized, long reach | Scales with link count, not bus width | Orders of magnitude higher than on-package | Scale-out |
The delicate part is latency. Vendors do not usually publish neat tables that say, “one package bridge hop adds X nanoseconds.” So any exact hop cost figure should be treated as architectural reasoning rather than a vendor spec. Still, the qualitative picture is clear:
- direct-adjacent HBM is best;
- bridge-extended HBM can remain close enough to be extremely valuable;
- but it is no longer identical to direct HBM.
That is the transition that changes the programming and scheduling model.
7) Why this is basically NUMA inside the package
NUMA, at its core, means that not all memory is equally cheap to reach. CPU systems have lived with this for years: local DRAM is fast; remote-socket DRAM is slower. The same idea now moves inward, into the package itself.
If some HBM stacks are directly adjacent and others are reached through one or more bridge structures, then the memory system has:
- near HBM — the best external memory path;
- far HBM — still on-package and still fast, but no longer equivalent;
- path-dependent access costs — memory location now matters, even inside one accelerator package.
This is why “NUMA” is the right intuition. Not because the package suddenly behaves like a dual-socket server in absolute latency terms, but because memory locality and path length have become first-order concerns.
That matters even if the per-access delta is modest. AI workloads issue enormous numbers of memory accesses. Small differences repeated relentlessly become visible at system scale.
8) The compute chiplet’s SRAM becomes the real shock absorber
Once external memory becomes non-uniform, the package needs a locality layer. That layer is not off-package and it is not distant HBM. It is the SRAM already sitting on the compute die: registers, L1/shared memory structures, and especially the shared lower-level cache.
One of the easiest mistakes to make when discussing giant memory fabrics is to focus only on HBM. But once the package stops being flat, on-die SRAM becomes even more strategically important. It is the layer that hides path variation.
This changes the role of SRAM from “helpful cache” to something more structural. SRAM becomes the place where the system tries to keep hot state, precisely because repeatedly paying far-HBM path costs is unnecessary if the workload has reuse.
What SRAM now does
- hides path length differences between near and far HBM
- absorbs temporal locality
- reduces repeated bridge traversal
- stabilizes performance when HBM topology is uneven
What this implies
- placement decisions matter more
- blocking/tiling strategies matter more
- runtime allocators need more awareness
- compilers and kernels can no longer assume “all HBM is equal”
Once you see the package this way, the chip stops looking like a monolithic GPU and starts looking like a carefully layered memory machine with compute embedded into the top of it.
9) Why the logic bridge is not optical — and where optical actually belongs
Another common confusion is to associate any “extended interconnect” with optical technology. That is not the right fit inside this package layer. Optical links are excellent when distance grows and serialized transport becomes attractive. But this package problem is about preserving an extremely wide, extremely local electrical character.
UCIe is helpful context because it shows where industry standardization is heading for die-to-die interoperability. The UCIe Consortium publicly positions the standard as an open die-to-die interconnect for chiplets within advanced packages.[9], [10] That standard matters, but it is still conceptually different from saying that HBM itself should be converted into a 100G/200G/400G-style optical link.
| Question | Logic bridge answer | Optical answer |
|---|---|---|
| Distance regime | Millimeters to centimeters | Board to rack and beyond |
| Data shape | Preserve dense electrical paths | Serialize onto lanes |
| Goal | Keep HBM-like behavior alive | Scale reach and link count |
| Main tradeoff | Topology-sensitive latency inside package | Higher latency but far greater reach |
So the right hierarchy is:
On-die / on-package: dense electrical locality Cross-die / package: chiplet links, bridges, package fabrics Board / rack / cluster: serialized links, optics, network fabrics
A good way to say it is: electrical wins for locality; optical wins for scale. The bridge belongs to the first world, not the second.
10) Who actually manufactures the bridge structures?
There is usually no neat answer like “Company X sells a standard logic-bridge chip.” This is not that kind of market. In practice, bridge structures live inside advanced packaging ecosystems and are tightly coupled to the package stack, process, bump technology, interposer strategy, and chiplet floorplan.
Intel’s EMIB is the cleanest public example of an embedded bridge concept in mainstream packaging language. Intel explicitly describes EMIB as an embedded multi-die interconnect bridge for high-bandwidth package connectivity.[2], [3] TSMC’s CoWoS and 3DFabric material, meanwhile, makes clear that large interposer-based package integration for logic plus HBM is central to the AI/HPC roadmap.[4], [5]
That means the manufacturing reality spans several layers:
Architectural owner
The accelerator designer decides the system topology, memory map, die partitioning, and performance targets.
Examples: NVIDIA, AMD, hyperscalers, internal AI silicon teams.
Packaging / foundry owner
The foundry and advanced packaging ecosystem provide the actual interposer, bridge, bump, routing, and assembly capabilities.
Examples: Intel EMIB; TSMC CoWoS / 3DFabric; Samsung advanced packaging efforts.
Software visibility
This is the dangerous part: software rarely gets a clean, easy abstraction of all this topology unless vendors deliberately expose it.
Which is why topology-aware runtime design matters so much.
In other words, the bridge is best thought of as a packaging primitive, not a commodity device.
11) The next bottleneck is not hardware — it is orchestration
The hardware story is only half the story. Once memory becomes distributed inside the package, the system needs to decide what data belongs where. A topology-aware package without topology-aware software can still deliver disappointing behavior, simply because hot state may end up living in the wrong place.
This is where the future chip stops being just a packaging discussion and becomes a runtime, compiler, and scheduling discussion.
What naive software will do
- treat all HBM as flat
- allocate without locality goals
- cause repeated traffic to farther HBM regions
- leave bandwidth on the table due to topology-blind placement
What better software will do
- classify memory into near and farther regions
- place hot tensors and high-reuse state close to the compute domain that consumes them
- use SRAM deliberately as a shield
- reduce expensive path crossings and repeated far-memory fetches
That leads naturally to a future stack in which the compiler, runtime, allocator, kernel scheduler, and package topology all become entangled in a good way. The question is no longer just “can I fit the model?” It becomes:
What should live in SRAM? What belongs in near HBM? What can be tolerated in farther HBM? Which compute domain should execute which portion of the graph? How much repeated traffic am I forcing across the package?
The answer will differ for training, inference, MoE routing, retrieval-heavy workloads, long-context decoding, and systems that aggressively pipeline activations or KV state. But the common idea is the same: data placement becomes a first-class optimization dimension.
12) The next-generation accelerator, viewed as a whole
Put all the pieces together, and the outline of a next-generation AI chip becomes much clearer.
It is not just a bigger die. It is not merely “more HBM.” It is a hierarchy:
Registers / local execution state
↓
L1 / shared SRAM
↓
Large shared on-die SRAM / cache
↓
Near HBM (best package-local capacity)
↓
Farther HBM reached through package bridges
↓
Chip-to-chip package fabrics
↓
Board/rack fabrics, increasingly optical
↓
Remote memory and storage tiers
Public roadmaps already show the ingredients. HBM bandwidth and capacity continue to rise.[1], [6], [8] Package technologies keep moving toward larger, denser chiplet-scale integration.[2], [4], [5] UCIe shows that die-to-die interconnect is becoming an ecosystem-level concern, not a one-off trick.[9], [10] Blackwell shows how aggressively multi-die structures are now being pushed at the top end, including internal chip-to-chip links at enormous bandwidth.[7], [8]
The deepest implication is this: the leading-edge AI chip of the next few years is likely to be memory-centric in a much stronger sense than many software people currently appreciate. The hardware problem is not just compute density. It is how to keep useful state close enough, long enough, and cheap enough to sustain the model’s actual working set.
Once package designers break the old HBM adjacency assumption, the chip stops being a simple “compute plus memory” object. It becomes a local fabric with tiers, path lengths, and non-uniformity. That is why the right mental model is not “a giant GPU.” It is:
And that is why the next leap after advanced packaging will not just be better silicon. It will be better placement, better data movement, better hierarchy control, and better software that finally understands the package it is running on.
References
- Samsung Semiconductor, HBM3E — public product page describing up to 9.2 Gbps and up to 1,180 GB/s bandwidth for 12-high HBM3E stacks.
- Intel, EMIB Product Brief — public description of Embedded Multi-die Interconnect Bridge technology and its packaging role.
- Intel, EMIB Technology Explained and Intel foundry packaging materials — public explanation of EMIB as an in-package high-density interconnect approach.
- TSMC, CoWoS® — public overview of chip-on-wafer-on-substrate packaging for logic plus HBM and large AI/HPC packages.
- TSMC, 3DFabric / advanced packaging materials — public details on CoWoS family and heterogeneous integration for HPC systems.
- NVIDIA, H200 — public product page describing 141 GB HBM3E and 4.8 TB/s bandwidth.
- NVIDIA, Blackwell Architecture — public statement that Blackwell products use two reticle-limited dies connected by a 10 TB/s chip-to-chip interconnect.
- NVIDIA Developer Blog, Inside NVIDIA Blackwell Ultra — public comparison table listing up to 192 GB HBM3E and 8 TB/s bandwidth for Blackwell generation products.
- UCIe Consortium, Specifications — public description of UCIe as a standardized die-to-die interconnect for chiplets.
- UCIe Consortium, Home / resources — public material describing the open chiplet ecosystem and current specification evolution.
Note on latency: where this essay discusses the qualitative latency cost of package bridges, it is making an architectural inference rather than quoting a vendor-published “per-hop” spec. Public vendor materials clearly support the packaging direction; they do not usually publish a canonical hop-latency table for bridge-extended HBM paths.