The Next-Generation AI Chip: From Flat HBM to an On-Package NUMA Fabric

1) The end of the “flat memory” era

For a long time, accelerator design benefited from a quiet simplification: all high-bandwidth memory was treated as essentially equivalent. That simplification made the memory model feel flat even when the packaging itself was very sophisticated.

The standard mental model was straightforward. Put one or more large compute dies in the center of the package. Surround them with HBM stacks. Let the memory controllers and on-package wiring do the rest. As long as every HBM stack sits very close to the compute die and the wiring remains short, dense, and highly synchronized, the compute fabric can treat the external memory system as a fast, almost uniform pool.

That assumption is now under pressure from all sides. Model sizes continue to expand. Activations, KV caches, embeddings, routing tables, and training state all compete for capacity. At the same time, it is no longer enough to improve peak FLOPs in isolation. The memory subsystem increasingly determines whether those FLOPs can be fed at all.

The real bottleneck is shifting from “how many operations can the chip do?” to “how much useful state can the package hold, and how intelligently can the system move that state?”

In other words, the future chip is not just a compute device. It is a hierarchy of storage and movement layers: on-die SRAM, near HBM, farther HBM reached through package structures, inter-chip fabrics, and eventually optical links that extend across the board and rack.

2) Why HBM hits a physical wall

High Bandwidth Memory works because it is radically different from a conventional, long-reach interface. The whole value proposition of HBM is tied to a very wide, very dense, very short connection. Samsung’s public HBM3E material, for example, advertises up to 1,180 GB/s per 12‑high stack at 9.2 Gbps, which gives a useful sense of the bandwidth class designers are operating in.^[1]

That is not achieved with a narrow serialized link. It is achieved with a broad interface and a short physical reach. The memory wins by staying close.

What makes HBM powerful

Ultra-wide interface
Very high aggregate bandwidth
Lower energy than long-reach serialized transport
Excellent fit for bandwidth-hungry AI workloads

What makes HBM fragile

Tight timing margins
Signal integrity sensitivity
Distance-dependent skew and loss
Severe package layout constraints

The hidden consequence

You cannot scale HBM count arbitrarily
Capacity gets capped by geometry
Compute scaling outruns memory scaling
The package becomes the bottleneck

Figure 1 — The classical HBM assumption

Classical HBM packaging works beautifully as long as all stacks remain physically close. The problem is that this geometry does not scale indefinitely.

This is why “just add more HBM” is not a trivial suggestion. Once physical distance grows, electrical behavior changes. At that point, memory scaling stops being a pure logic problem and turns into a package physics problem.

3) What a logic bridge actually is

The key architectural idea is to stop treating HBM adjacency as sacred. Instead of insisting that every HBM stack sit directly beside compute, a package can insert small bridge structures that restore and carry signals farther across the substrate or interposer.

The important part is not the phrase “logic bridge.” The important part is the role: keep the interface electrically viable over a longer path without collapsing into a slow off-package transport model.

Think of the bridge as

a distributed signal extension element
a package-native repeater / retiming structure
something much closer to “smart wiring” than to a processor
a way to preserve high-density electrical connectivity beyond the old reach limit

Do not think of the bridge as

a PCIe switch
a NIC
a general packet router
a central memory controller hub aggregating traffic from every stack

Intel’s public material on EMIB is useful here because it shows how mainstream advanced packaging has already moved toward high-density in-package bridges. Intel describes EMIB as an embedded multi-die interconnect bridge that enables high-bandwidth communication between dies in one package.^{[2], [3]} TSMC’s CoWoS public pages make the complementary point from the interposer side: advanced package platforms are explicitly being built for logic chiplets plus HBM at scale.^{[4], [5]}

So the conceptual move here is not science fiction. It is a natural extension of where advanced packaging already is: denser interconnect, larger package fabrics, more chiplets, more HBM, and more package-level structure standing between compute and memory.

4) What the bridge is not: why this is not just “a custom chip in the middle”

A very natural misunderstanding is to imagine one custom bridge die sitting in the middle like a little switch:

HBM ─┐
HBM ─┼── [Bridge Chip] ── Compute
HBM ─┘

That model is intuitive, but it is the wrong one for understanding why this can preserve near-HBM behavior. A central switching element would immediately raise questions about arbitration, buffering, serialization, and aggregated bottlenecks. It would behave less like HBM and more like a fabric endpoint.

The more accurate picture is distributed and geometric. Several bridge structures are embedded in the package, and different memory paths may traverse one or more of them. The compute chiplet still talks to HBM over dense package interconnect; the bridges simply allow some of those paths to go farther.

Figure 2 — The distributed-bridge mental model

The useful mental model is not a central switch, but a distributed package fabric in which some HBM paths are direct and others are extended by local bridge elements.

That distinction matters enormously. If the system collapsed onto a narrow serialized bottleneck, it would stop being HBM-like. The whole point is to preserve broad, local, electrical connectivity while changing the reach envelope.

5) The topology shift: from “compute surrounded by memory” to “compute embedded inside memory”

Once bridges exist, the package is no longer a ring of adjacent HBM around a center die. It becomes a true topology. Some stacks are one electrical neighborhood away. Others are two neighborhoods away. What matters is no longer just capacity; it is path.

This is the architectural shift that makes the entire conversation interesting: the package stops behaving like a flat pool and starts behaving like a memory fabric.

Figure 3 — Before and after

Left: all HBM is treated as roughly equivalent because it is tightly packed around compute. Right: the package grows into a fabric with path-dependent reach, and therefore path-dependent behavior.

The practical consequence is simple but profound: a memory access now has a topology. There is “where” the line lives, not just “whether” the line exists.

6) Bandwidth survives; latency becomes nuanced

The reason this architecture is powerful is that it can preserve the fundamental strength of HBM—massive local bandwidth—without immediately collapsing into the much slower economics of board-level or rack-level links.

Public product data gives the right scale for intuition. NVIDIA’s H200 public page advertises 141 GB of HBM3E at 4.8 TB/s, and NVIDIA’s later Blackwell material advertises up to 192 GB HBM3E and 8 TB/s of HBM bandwidth, with two reticle-limited dies connected by a 10 TB/s chip-to-chip link inside the GPU.^{[6], [7], [8]}

Those numbers matter because they show the class of performance advanced packages are already trying to sustain. The whole point of an embedded bridge structure is to extend reach without falling out of this performance envelope.

Interconnect style	Typical shape	Bandwidth model	Latency character	Best use
Direct HBM	Ultra-wide, very short electrical	Hundreds of GB/s to TB/s per stack class	Lowest external-memory latency in package	Near memory
HBM via package bridge	Still package-local, extended electrical path	Near-HBM class if designed well	Slightly higher; path-dependent	Farther on-package memory
Chip-to-chip die interconnect	Dedicated package/chiplet link	Very high but structured	Higher than direct local HBM path	Multi-die compute fabrics
Board/rack optical or network	Serialized, long reach	Scales with link count, not bus width	Orders of magnitude higher than on-package	Scale-out

The delicate part is latency. Vendors do not usually publish neat tables that say, “one package bridge hop adds X nanoseconds.” So any exact hop cost figure should be treated as architectural reasoning rather than a vendor spec. Still, the qualitative picture is clear:

direct-adjacent HBM is best;
bridge-extended HBM can remain close enough to be extremely valuable;
but it is no longer identical to direct HBM.

The likely outcome is not “HBM becomes slow.” The likely outcome is “HBM stops being perfectly uniform.”

That is the transition that changes the programming and scheduling model.

7) Why this is basically NUMA inside the package

NUMA, at its core, means that not all memory is equally cheap to reach. CPU systems have lived with this for years: local DRAM is fast; remote-socket DRAM is slower. The same idea now moves inward, into the package itself.

If some HBM stacks are directly adjacent and others are reached through one or more bridge structures, then the memory system has:

near HBM — the best external memory path;
far HBM — still on-package and still fast, but no longer equivalent;
path-dependent access costs — memory location now matters, even inside one accelerator package.

Figure 4 — On-package NUMA

The easiest way to understand the package is to think in terms of near HBM and far HBM. Both are valuable. They are simply not identical anymore.

This is why “NUMA” is the right intuition. Not because the package suddenly behaves like a dual-socket server in absolute latency terms, but because memory locality and path length have become first-order concerns.

That matters even if the per-access delta is modest. AI workloads issue enormous numbers of memory accesses. Small differences repeated relentlessly become visible at system scale.

8) The compute chiplet’s SRAM becomes the real shock absorber

Once external memory becomes non-uniform, the package needs a locality layer. That layer is not off-package and it is not distant HBM. It is the SRAM already sitting on the compute die: registers, L1/shared memory structures, and especially the shared lower-level cache.

One of the easiest mistakes to make when discussing giant memory fabrics is to focus only on HBM. But once the package stops being flat, on-die SRAM becomes even more strategically important. It is the layer that hides path variation.

Figure 5 — Memory hierarchy of a future accelerator

The future chip is not “compute + memory.” It is a tiered hierarchy in which SRAM is the fast locality layer, HBM is the capacity layer, and fabrics above the package provide additional scale.

This changes the role of SRAM from “helpful cache” to something more structural. SRAM becomes the place where the system tries to keep hot state, precisely because repeatedly paying far-HBM path costs is unnecessary if the workload has reuse.

What SRAM now does

hides path length differences between near and far HBM
absorbs temporal locality
reduces repeated bridge traversal
stabilizes performance when HBM topology is uneven

What this implies

placement decisions matter more
blocking/tiling strategies matter more
runtime allocators need more awareness
compilers and kernels can no longer assume “all HBM is equal”

Once you see the package this way, the chip stops looking like a monolithic GPU and starts looking like a carefully layered memory machine with compute embedded into the top of it.

9) Why the logic bridge is not optical — and where optical actually belongs

Another common confusion is to associate any “extended interconnect” with optical technology. That is not the right fit inside this package layer. Optical links are excellent when distance grows and serialized transport becomes attractive. But this package problem is about preserving an extremely wide, extremely local electrical character.

UCIe is helpful context because it shows where industry standardization is heading for die-to-die interoperability. The UCIe Consortium publicly positions the standard as an open die-to-die interconnect for chiplets within advanced packages.^{[9], [10]} That standard matters, but it is still conceptually different from saying that HBM itself should be converted into a 100G/200G/400G-style optical link.

Question	Logic bridge answer	Optical answer
Distance regime	Millimeters to centimeters	Board to rack and beyond
Data shape	Preserve dense electrical paths	Serialize onto lanes
Goal	Keep HBM-like behavior alive	Scale reach and link count
Main tradeoff	Topology-sensitive latency inside package	Higher latency but far greater reach

So the right hierarchy is:

On-die / on-package:     dense electrical locality
Cross-die / package:     chiplet links, bridges, package fabrics
Board / rack / cluster:  serialized links, optics, network fabrics

A good way to say it is: electrical wins for locality; optical wins for scale. The bridge belongs to the first world, not the second.

10) Who actually manufactures the bridge structures?

There is usually no neat answer like “Company X sells a standard logic-bridge chip.” This is not that kind of market. In practice, bridge structures live inside advanced packaging ecosystems and are tightly coupled to the package stack, process, bump technology, interposer strategy, and chiplet floorplan.

Intel’s EMIB is the cleanest public example of an embedded bridge concept in mainstream packaging language. Intel explicitly describes EMIB as an embedded multi-die interconnect bridge for high-bandwidth package connectivity.^{[2], [3]} TSMC’s CoWoS and 3DFabric material, meanwhile, makes clear that large interposer-based package integration for logic plus HBM is central to the AI/HPC roadmap.^{[4], [5]}

That means the manufacturing reality spans several layers:

Architectural owner

The accelerator designer decides the system topology, memory map, die partitioning, and performance targets.

Examples: NVIDIA, AMD, hyperscalers, internal AI silicon teams.

Packaging / foundry owner

The foundry and advanced packaging ecosystem provide the actual interposer, bridge, bump, routing, and assembly capabilities.

Examples: Intel EMIB; TSMC CoWoS / 3DFabric; Samsung advanced packaging efforts.

Software visibility

This is the dangerous part: software rarely gets a clean, easy abstraction of all this topology unless vendors deliberately expose it.

Which is why topology-aware runtime design matters so much.

In other words, the bridge is best thought of as a packaging primitive, not a commodity device.

11) The next bottleneck is not hardware — it is orchestration

The hardware story is only half the story. Once memory becomes distributed inside the package, the system needs to decide what data belongs where. A topology-aware package without topology-aware software can still deliver disappointing behavior, simply because hot state may end up living in the wrong place.

This is where the future chip stops being just a packaging discussion and becomes a runtime, compiler, and scheduling discussion.

What naive software will do

treat all HBM as flat
allocate without locality goals
cause repeated traffic to farther HBM regions
leave bandwidth on the table due to topology-blind placement

What better software will do

classify memory into near and farther regions
place hot tensors and high-reuse state close to the compute domain that consumes them
use SRAM deliberately as a shield
reduce expensive path crossings and repeated far-memory fetches

That leads naturally to a future stack in which the compiler, runtime, allocator, kernel scheduler, and package topology all become entangled in a good way. The question is no longer just “can I fit the model?” It becomes:

What should live in SRAM?
What belongs in near HBM?
What can be tolerated in farther HBM?
Which compute domain should execute which portion of the graph?
How much repeated traffic am I forcing across the package?

The answer will differ for training, inference, MoE routing, retrieval-heavy workloads, long-context decoding, and systems that aggressively pipeline activations or KV state. But the common idea is the same: data placement becomes a first-class optimization dimension.

Hardware can create the memory fabric. Only orchestration can make that fabric behave intelligently.

12) The next-generation accelerator, viewed as a whole

Put all the pieces together, and the outline of a next-generation AI chip becomes much clearer.

It is not just a bigger die. It is not merely “more HBM.” It is a hierarchy:

Registers / local execution state
        ↓
L1 / shared SRAM
        ↓
Large shared on-die SRAM / cache
        ↓
Near HBM (best package-local capacity)
        ↓
Farther HBM reached through package bridges
        ↓
Chip-to-chip package fabrics
        ↓
Board/rack fabrics, increasingly optical
        ↓
Remote memory and storage tiers

Public roadmaps already show the ingredients. HBM bandwidth and capacity continue to rise.^{[1], [6], [8]} Package technologies keep moving toward larger, denser chiplet-scale integration.^{[2], [4], [5]} UCIe shows that die-to-die interconnect is becoming an ecosystem-level concern, not a one-off trick.^{[9], [10]} Blackwell shows how aggressively multi-die structures are now being pushed at the top end, including internal chip-to-chip links at enormous bandwidth.^{[7], [8]}

Figure 6 — A plausible next-generation AI chip stack

The next-generation accelerator is best thought of as a stacked locality hierarchy. As you move downward, capacity expands and reach grows — but locality becomes more valuable and orchestration becomes more important.

The deepest implication is this: the leading-edge AI chip of the next few years is likely to be memory-centric in a much stronger sense than many software people currently appreciate. The hardware problem is not just compute density. It is how to keep useful state close enough, long enough, and cheap enough to sustain the model’s actual working set.

Once package designers break the old HBM adjacency assumption, the chip stops being a simple “compute plus memory” object. It becomes a local fabric with tiers, path lengths, and non-uniformity. That is why the right mental model is not “a giant GPU.” It is:

a topology-aware memory system with compute embedded inside it.

And that is why the next leap after advanced packaging will not just be better silicon. It will be better placement, better data movement, better hierarchy control, and better software that finally understands the package it is running on.

References

Samsung Semiconductor, HBM3E — public product page describing up to 9.2 Gbps and up to 1,180 GB/s bandwidth for 12-high HBM3E stacks.
Intel, EMIB Product Brief — public description of Embedded Multi-die Interconnect Bridge technology and its packaging role.
Intel, EMIB Technology Explained and Intel foundry packaging materials — public explanation of EMIB as an in-package high-density interconnect approach.
TSMC, CoWoS® — public overview of chip-on-wafer-on-substrate packaging for logic plus HBM and large AI/HPC packages.
TSMC, 3DFabric / advanced packaging materials — public details on CoWoS family and heterogeneous integration for HPC systems.
NVIDIA, H200 — public product page describing 141 GB HBM3E and 4.8 TB/s bandwidth.
NVIDIA, Blackwell Architecture — public statement that Blackwell products use two reticle-limited dies connected by a 10 TB/s chip-to-chip interconnect.
NVIDIA Developer Blog, Inside NVIDIA Blackwell Ultra — public comparison table listing up to 192 GB HBM3E and 8 TB/s bandwidth for Blackwell generation products.
UCIe Consortium, Specifications — public description of UCIe as a standardized die-to-die interconnect for chiplets.
UCIe Consortium, Home / resources — public material describing the open chiplet ecosystem and current specification evolution.

Note on latency: where this essay discusses the qualitative latency cost of package bridges, it is making an architectural inference rather than quoting a vendor-published “per-hop” spec. Public vendor materials clearly support the packaging direction; they do not usually publish a canonical hop-latency table for bridge-extended HBM paths.

The Next-Generation AI Chip:
From Flat HBM to an On‑Package NUMA Fabric

Core thesis