Architecture Essay

The Next-Generation AI Chip:
From Flat HBM to an On‑Package NUMA Fabric

The biggest shift in AI silicon is no longer just more tensor cores or more FLOPs. It is the transformation of memory from a flat, uniform pool into a structured, topology-aware fabric with near HBM, far HBM, large SRAM layers, high-density package bridges, and optical fabrics above the package.

HBM3E-class bandwidth Distributed bridges On-package NUMA SRAM locality layer Optical above package

Core thesis

The next frontier accelerator is best understood not as “a bigger GPU,” but as a memory system with compute embedded inside it. Once package designers break the old adjacency constraint of HBM using embedded bridge structures, the chip stops behaving like a flat memory machine and starts behaving like a NUMA system in miniature.

Bandwidth~1.18 TB/sper Samsung HBM3E 12‑high stack at 9.2 Gbps
Modern scale8 TB/sNVIDIA Blackwell max HBM bandwidth
Topology shiftNear ≠ FarHBM becomes path-dependent
PackagingBridge + InterposerEMIB / CoWoS / chiplet era

1) The end of the “flat memory” era

For a long time, accelerator design benefited from a quiet simplification: all high-bandwidth memory was treated as essentially equivalent. That simplification made the memory model feel flat even when the packaging itself was very sophisticated.

The standard mental model was straightforward. Put one or more large compute dies in the center of the package. Surround them with HBM stacks. Let the memory controllers and on-package wiring do the rest. As long as every HBM stack sits very close to the compute die and the wiring remains short, dense, and highly synchronized, the compute fabric can treat the external memory system as a fast, almost uniform pool.

That assumption is now under pressure from all sides. Model sizes continue to expand. Activations, KV caches, embeddings, routing tables, and training state all compete for capacity. At the same time, it is no longer enough to improve peak FLOPs in isolation. The memory subsystem increasingly determines whether those FLOPs can be fed at all.

The real bottleneck is shifting from “how many operations can the chip do?” to “how much useful state can the package hold, and how intelligently can the system move that state?”

In other words, the future chip is not just a compute device. It is a hierarchy of storage and movement layers: on-die SRAM, near HBM, farther HBM reached through package structures, inter-chip fabrics, and eventually optical links that extend across the board and rack.

2) Why HBM hits a physical wall

High Bandwidth Memory works because it is radically different from a conventional, long-reach interface. The whole value proposition of HBM is tied to a very wide, very dense, very short connection. Samsung’s public HBM3E material, for example, advertises up to 1,180 GB/s per 12‑high stack at 9.2 Gbps, which gives a useful sense of the bandwidth class designers are operating in.[1]

That is not achieved with a narrow serialized link. It is achieved with a broad interface and a short physical reach. The memory wins by staying close.

What makes HBM powerful

  • Ultra-wide interface
  • Very high aggregate bandwidth
  • Lower energy than long-reach serialized transport
  • Excellent fit for bandwidth-hungry AI workloads

What makes HBM fragile

  • Tight timing margins
  • Signal integrity sensitivity
  • Distance-dependent skew and loss
  • Severe package layout constraints

The hidden consequence

  • You cannot scale HBM count arbitrarily
  • Capacity gets capped by geometry
  • Compute scaling outruns memory scaling
  • The package becomes the bottleneck
Figure 1 — The classical HBM assumption
Compute Die HBMHBM HBMHBM HBMHBM
Classical HBM packaging works beautifully as long as all stacks remain physically close. The problem is that this geometry does not scale indefinitely.

This is why “just add more HBM” is not a trivial suggestion. Once physical distance grows, electrical behavior changes. At that point, memory scaling stops being a pure logic problem and turns into a package physics problem.

3) What a logic bridge actually is

The key architectural idea is to stop treating HBM adjacency as sacred. Instead of insisting that every HBM stack sit directly beside compute, a package can insert small bridge structures that restore and carry signals farther across the substrate or interposer.

The important part is not the phrase “logic bridge.” The important part is the role: keep the interface electrically viable over a longer path without collapsing into a slow off-package transport model.

Think of the bridge as

  • a distributed signal extension element
  • a package-native repeater / retiming structure
  • something much closer to “smart wiring” than to a processor
  • a way to preserve high-density electrical connectivity beyond the old reach limit

Do not think of the bridge as

  • a PCIe switch
  • a NIC
  • a general packet router
  • a central memory controller hub aggregating traffic from every stack

Intel’s public material on EMIB is useful here because it shows how mainstream advanced packaging has already moved toward high-density in-package bridges. Intel describes EMIB as an embedded multi-die interconnect bridge that enables high-bandwidth communication between dies in one package.[2], [3] TSMC’s CoWoS public pages make the complementary point from the interposer side: advanced package platforms are explicitly being built for logic chiplets plus HBM at scale.[4], [5]

So the conceptual move here is not science fiction. It is a natural extension of where advanced packaging already is: denser interconnect, larger package fabrics, more chiplets, more HBM, and more package-level structure standing between compute and memory.

4) What the bridge is not: why this is not just “a custom chip in the middle”

A very natural misunderstanding is to imagine one custom bridge die sitting in the middle like a little switch:

HBM ─┐
HBM ─┼── [Bridge Chip] ── Compute
HBM ─┘

That model is intuitive, but it is the wrong one for understanding why this can preserve near-HBM behavior. A central switching element would immediately raise questions about arbitration, buffering, serialization, and aggregated bottlenecks. It would behave less like HBM and more like a fabric endpoint.

The more accurate picture is distributed and geometric. Several bridge structures are embedded in the package, and different memory paths may traverse one or more of them. The compute chiplet still talks to HBM over dense package interconnect; the bridges simply allow some of those paths to go farther.

Figure 2 — The distributed-bridge mental model
Compute HBMHBMHBM HBMHBMHBM BridgeBridge BridgeBridge BridgeBridge
The useful mental model is not a central switch, but a distributed package fabric in which some HBM paths are direct and others are extended by local bridge elements.

That distinction matters enormously. If the system collapsed onto a narrow serialized bottleneck, it would stop being HBM-like. The whole point is to preserve broad, local, electrical connectivity while changing the reach envelope.

5) The topology shift: from “compute surrounded by memory” to “compute embedded inside memory”

Once bridges exist, the package is no longer a ring of adjacent HBM around a center die. It becomes a true topology. Some stacks are one electrical neighborhood away. Others are two neighborhoods away. What matters is no longer just capacity; it is path.

This is the architectural shift that makes the entire conversation interesting: the package stops behaving like a flat pool and starts behaving like a memory fabric.

Figure 3 — Before and after
Flat HBM Era Expanded Fabric Era Compute HBMHBM HBMHBM HBMHBM HBMHBMHBM HBMHBM HBMHBMHBM Compute bridgebridge bridgebridge bridgebridge bridge bridge
Left: all HBM is treated as roughly equivalent because it is tightly packed around compute. Right: the package grows into a fabric with path-dependent reach, and therefore path-dependent behavior.

The practical consequence is simple but profound: a memory access now has a topology. There is “where” the line lives, not just “whether” the line exists.

6) Bandwidth survives; latency becomes nuanced

The reason this architecture is powerful is that it can preserve the fundamental strength of HBM—massive local bandwidth—without immediately collapsing into the much slower economics of board-level or rack-level links.

Public product data gives the right scale for intuition. NVIDIA’s H200 public page advertises 141 GB of HBM3E at 4.8 TB/s, and NVIDIA’s later Blackwell material advertises up to 192 GB HBM3E and 8 TB/s of HBM bandwidth, with two reticle-limited dies connected by a 10 TB/s chip-to-chip link inside the GPU.[6], [7], [8]

Those numbers matter because they show the class of performance advanced packages are already trying to sustain. The whole point of an embedded bridge structure is to extend reach without falling out of this performance envelope.

Interconnect style Typical shape Bandwidth model Latency character Best use
Direct HBM Ultra-wide, very short electrical Hundreds of GB/s to TB/s per stack class Lowest external-memory latency in package Near memory
HBM via package bridge Still package-local, extended electrical path Near-HBM class if designed well Slightly higher; path-dependent Farther on-package memory
Chip-to-chip die interconnect Dedicated package/chiplet link Very high but structured Higher than direct local HBM path Multi-die compute fabrics
Board/rack optical or network Serialized, long reach Scales with link count, not bus width Orders of magnitude higher than on-package Scale-out

The delicate part is latency. Vendors do not usually publish neat tables that say, “one package bridge hop adds X nanoseconds.” So any exact hop cost figure should be treated as architectural reasoning rather than a vendor spec. Still, the qualitative picture is clear:

  • direct-adjacent HBM is best;
  • bridge-extended HBM can remain close enough to be extremely valuable;
  • but it is no longer identical to direct HBM.
The likely outcome is not “HBM becomes slow.” The likely outcome is “HBM stops being perfectly uniform.”

That is the transition that changes the programming and scheduling model.

7) Why this is basically NUMA inside the package

NUMA, at its core, means that not all memory is equally cheap to reach. CPU systems have lived with this for years: local DRAM is fast; remote-socket DRAM is slower. The same idea now moves inward, into the package itself.

If some HBM stacks are directly adjacent and others are reached through one or more bridge structures, then the memory system has:

  • near HBM — the best external memory path;
  • far HBM — still on-package and still fast, but no longer equivalent;
  • path-dependent access costs — memory location now matters, even inside one accelerator package.
Figure 4 — On-package NUMA
Compute Near HBM Near HBM Far HBM Far HBM bridge bridge shortest path extended path
The easiest way to understand the package is to think in terms of near HBM and far HBM. Both are valuable. They are simply not identical anymore.

This is why “NUMA” is the right intuition. Not because the package suddenly behaves like a dual-socket server in absolute latency terms, but because memory locality and path length have become first-order concerns.

That matters even if the per-access delta is modest. AI workloads issue enormous numbers of memory accesses. Small differences repeated relentlessly become visible at system scale.

8) The compute chiplet’s SRAM becomes the real shock absorber

Once external memory becomes non-uniform, the package needs a locality layer. That layer is not off-package and it is not distant HBM. It is the SRAM already sitting on the compute die: registers, L1/shared memory structures, and especially the shared lower-level cache.

One of the easiest mistakes to make when discussing giant memory fabrics is to focus only on HBM. But once the package stops being flat, on-die SRAM becomes even more strategically important. It is the layer that hides path variation.

Figure 5 — Memory hierarchy of a future accelerator
Registers / local state L1 / Shared SRAM Large shared SRAM / L2 Near HBM + Far HBM fabric Inter-package / board / optical fabric / remote tiers
The future chip is not “compute + memory.” It is a tiered hierarchy in which SRAM is the fast locality layer, HBM is the capacity layer, and fabrics above the package provide additional scale.

This changes the role of SRAM from “helpful cache” to something more structural. SRAM becomes the place where the system tries to keep hot state, precisely because repeatedly paying far-HBM path costs is unnecessary if the workload has reuse.

What SRAM now does

  • hides path length differences between near and far HBM
  • absorbs temporal locality
  • reduces repeated bridge traversal
  • stabilizes performance when HBM topology is uneven

What this implies

  • placement decisions matter more
  • blocking/tiling strategies matter more
  • runtime allocators need more awareness
  • compilers and kernels can no longer assume “all HBM is equal”

Once you see the package this way, the chip stops looking like a monolithic GPU and starts looking like a carefully layered memory machine with compute embedded into the top of it.

9) Why the logic bridge is not optical — and where optical actually belongs

Another common confusion is to associate any “extended interconnect” with optical technology. That is not the right fit inside this package layer. Optical links are excellent when distance grows and serialized transport becomes attractive. But this package problem is about preserving an extremely wide, extremely local electrical character.

UCIe is helpful context because it shows where industry standardization is heading for die-to-die interoperability. The UCIe Consortium publicly positions the standard as an open die-to-die interconnect for chiplets within advanced packages.[9], [10] That standard matters, but it is still conceptually different from saying that HBM itself should be converted into a 100G/200G/400G-style optical link.

Question Logic bridge answer Optical answer
Distance regime Millimeters to centimeters Board to rack and beyond
Data shape Preserve dense electrical paths Serialize onto lanes
Goal Keep HBM-like behavior alive Scale reach and link count
Main tradeoff Topology-sensitive latency inside package Higher latency but far greater reach

So the right hierarchy is:

On-die / on-package:     dense electrical locality
Cross-die / package:     chiplet links, bridges, package fabrics
Board / rack / cluster:  serialized links, optics, network fabrics

A good way to say it is: electrical wins for locality; optical wins for scale. The bridge belongs to the first world, not the second.

10) Who actually manufactures the bridge structures?

There is usually no neat answer like “Company X sells a standard logic-bridge chip.” This is not that kind of market. In practice, bridge structures live inside advanced packaging ecosystems and are tightly coupled to the package stack, process, bump technology, interposer strategy, and chiplet floorplan.

Intel’s EMIB is the cleanest public example of an embedded bridge concept in mainstream packaging language. Intel explicitly describes EMIB as an embedded multi-die interconnect bridge for high-bandwidth package connectivity.[2], [3] TSMC’s CoWoS and 3DFabric material, meanwhile, makes clear that large interposer-based package integration for logic plus HBM is central to the AI/HPC roadmap.[4], [5]

That means the manufacturing reality spans several layers:

Architectural owner

The accelerator designer decides the system topology, memory map, die partitioning, and performance targets.

Examples: NVIDIA, AMD, hyperscalers, internal AI silicon teams.

Packaging / foundry owner

The foundry and advanced packaging ecosystem provide the actual interposer, bridge, bump, routing, and assembly capabilities.

Examples: Intel EMIB; TSMC CoWoS / 3DFabric; Samsung advanced packaging efforts.

Software visibility

This is the dangerous part: software rarely gets a clean, easy abstraction of all this topology unless vendors deliberately expose it.

Which is why topology-aware runtime design matters so much.

In other words, the bridge is best thought of as a packaging primitive, not a commodity device.

11) The next bottleneck is not hardware — it is orchestration

The hardware story is only half the story. Once memory becomes distributed inside the package, the system needs to decide what data belongs where. A topology-aware package without topology-aware software can still deliver disappointing behavior, simply because hot state may end up living in the wrong place.

This is where the future chip stops being just a packaging discussion and becomes a runtime, compiler, and scheduling discussion.

What naive software will do

  • treat all HBM as flat
  • allocate without locality goals
  • cause repeated traffic to farther HBM regions
  • leave bandwidth on the table due to topology-blind placement

What better software will do

  • classify memory into near and farther regions
  • place hot tensors and high-reuse state close to the compute domain that consumes them
  • use SRAM deliberately as a shield
  • reduce expensive path crossings and repeated far-memory fetches

That leads naturally to a future stack in which the compiler, runtime, allocator, kernel scheduler, and package topology all become entangled in a good way. The question is no longer just “can I fit the model?” It becomes:

What should live in SRAM?
What belongs in near HBM?
What can be tolerated in farther HBM?
Which compute domain should execute which portion of the graph?
How much repeated traffic am I forcing across the package?

The answer will differ for training, inference, MoE routing, retrieval-heavy workloads, long-context decoding, and systems that aggressively pipeline activations or KV state. But the common idea is the same: data placement becomes a first-class optimization dimension.

Hardware can create the memory fabric. Only orchestration can make that fabric behave intelligently.

12) The next-generation accelerator, viewed as a whole

Put all the pieces together, and the outline of a next-generation AI chip becomes much clearer.

It is not just a bigger die. It is not merely “more HBM.” It is a hierarchy:

Registers / local execution state
        ↓
L1 / shared SRAM
        ↓
Large shared on-die SRAM / cache
        ↓
Near HBM (best package-local capacity)
        ↓
Farther HBM reached through package bridges
        ↓
Chip-to-chip package fabrics
        ↓
Board/rack fabrics, increasingly optical
        ↓
Remote memory and storage tiers

Public roadmaps already show the ingredients. HBM bandwidth and capacity continue to rise.[1], [6], [8] Package technologies keep moving toward larger, denser chiplet-scale integration.[2], [4], [5] UCIe shows that die-to-die interconnect is becoming an ecosystem-level concern, not a one-off trick.[9], [10] Blackwell shows how aggressively multi-die structures are now being pushed at the top end, including internal chip-to-chip links at enormous bandwidth.[7], [8]

Figure 6 — A plausible next-generation AI chip stack
Compute cores / local execution SRAM hierarchy Near HBM + Far HBM via package bridges Die-to-die / chiplet fabric / package interconnect Board / rack optical + remote tiers
The next-generation accelerator is best thought of as a stacked locality hierarchy. As you move downward, capacity expands and reach grows — but locality becomes more valuable and orchestration becomes more important.

The deepest implication is this: the leading-edge AI chip of the next few years is likely to be memory-centric in a much stronger sense than many software people currently appreciate. The hardware problem is not just compute density. It is how to keep useful state close enough, long enough, and cheap enough to sustain the model’s actual working set.

Once package designers break the old HBM adjacency assumption, the chip stops being a simple “compute plus memory” object. It becomes a local fabric with tiers, path lengths, and non-uniformity. That is why the right mental model is not “a giant GPU.” It is:

a topology-aware memory system with compute embedded inside it.

And that is why the next leap after advanced packaging will not just be better silicon. It will be better placement, better data movement, better hierarchy control, and better software that finally understands the package it is running on.

References

  1. Samsung Semiconductor, HBM3E — public product page describing up to 9.2 Gbps and up to 1,180 GB/s bandwidth for 12-high HBM3E stacks.
  2. Intel, EMIB Product Brief — public description of Embedded Multi-die Interconnect Bridge technology and its packaging role.
  3. Intel, EMIB Technology Explained and Intel foundry packaging materials — public explanation of EMIB as an in-package high-density interconnect approach.
  4. TSMC, CoWoS® — public overview of chip-on-wafer-on-substrate packaging for logic plus HBM and large AI/HPC packages.
  5. TSMC, 3DFabric / advanced packaging materials — public details on CoWoS family and heterogeneous integration for HPC systems.
  6. NVIDIA, H200 — public product page describing 141 GB HBM3E and 4.8 TB/s bandwidth.
  7. NVIDIA, Blackwell Architecture — public statement that Blackwell products use two reticle-limited dies connected by a 10 TB/s chip-to-chip interconnect.
  8. NVIDIA Developer Blog, Inside NVIDIA Blackwell Ultra — public comparison table listing up to 192 GB HBM3E and 8 TB/s bandwidth for Blackwell generation products.
  9. UCIe Consortium, Specifications — public description of UCIe as a standardized die-to-die interconnect for chiplets.
  10. UCIe Consortium, Home / resources — public material describing the open chiplet ecosystem and current specification evolution.

Note on latency: where this essay discusses the qualitative latency cost of package bridges, it is making an architectural inference rather than quoting a vendor-published “per-hop” spec. Public vendor materials clearly support the packaging direction; they do not usually publish a canonical hop-latency table for bridge-extended HBM paths.