Why AI Needs a New Memory Hierarchy, Not Just Bigger Caches | Writings

There is a very understandable instinct in modern AI chip design: make the caches bigger. Add more L1. Add more L2. Add some kind of huge shared L3 or system cache. Keep more data on die, reduce expensive off-chip traffic, and let the machine spend more of its time doing useful work.

That instinct is not wrong. In fact, it is often directionally right. Bigger on-die memory can be one of the highest-return levers in AI systems, especially when power is scarce, external movement is expensive, and hyperscalers are willing to spend silicon if it reduces joules per useful token.

But if AI really is forcing us to rethink architecture from scratch, then the deeper opportunity is not just “more cache.” It is to stop inheriting the generic CPU memory hierarchy as a default assumption and instead design an AI-native residency hierarchy built around the actual data classes that dominate inference and training: weights, KV cache, activations, expert blocks, routing state, optimizer state, and metadata.

That is the real paradigm shift.

The future is probably not a bigger generic cache pyramid. It is a managed on-die residency fabric that knows the difference between a weight block, a KV region, an activation tile, and an expert shard.

First: Bigger Caches Do Help

Let’s start with the part that is easy to undersell: larger L1, L2, and shared on-die SRAM can be very powerful in AI systems.

Why? Because AI workloads are dominated by expensive data movement. If more hot state can stay on die, the system avoids trips outward to HBM, DRAM, CXL, or SSD-backed staging tiers. Those avoided misses are often worth far more than a few cycles of extra hit latency inside the cache itself.

Misses

The biggest win from larger caches is often reducing expensive off-chip misses, not making hits dramatically faster

Area

On-die SRAM costs real silicon and power, so every extra megabyte is an architectural choice

Inference often has strong reuse structure but very expensive penalties for losing locality

Power

If power is the real limit, spending area to reduce movement can be a good trade

The right mental model is not “bigger L2 makes everything lower latency.” Often it does not, at least not directly. Large caches can even increase hit latency somewhat. The real win is that they can dramatically reduce the frequency of catastrophically slower misses.

If a slightly larger L2 increases hit latency by a little but cuts off-chip misses by a lot, the average effective access time can still improve enormously. More importantly, the machine may stop saturating external bandwidth on repeated refill of the same logically hot data.

But Bigger Caches Are Not Free

Now the harder part.

Every extra chunk of L1, L2, or L3 costs:

Die area, which could have gone to compute, network-on-chip, compression engines, scheduling hardware, or something more specialized.
Power, both dynamic and leakage.
Latency risk, especially at the smallest and most timing-sensitive levels.
Routing complexity, because large on-die memory is really a wire problem as much as a bitcell problem.
Architectural opportunity cost, because “more cache” is one possible use of silicon, not the only use.

So the real design question is not “would more cache help?” It is:

Design Question

Is the next square millimeter and the next watt better spent on larger generic caches, or on a more AI-specific memory and control fabric that captures the same hot state with better semantics?

The Deeper Problem: AI Data Is Not Generic

Traditional cache hierarchies were designed for general-purpose software. They assume opaque working sets, irregular control flow, and hardware inference of locality. That is a sensible approach when the processor has no semantic understanding of the program’s data.

AI workloads are different. They contain large, structured objects with very different access patterns and very different costs of demotion, movement, and restoration. A weight shard is not the same thing as a KV page. A KV page is not the same thing as an activation tile. An activation tile is not the same thing as an MoE expert block. An expert block is not the same thing as routing metadata.

Yet a generic cache hierarchy tends to treat all of them as just lines.

That is the core mismatch.

What AI Really Wants from Memory

If you were designing an AI processor from scratch, the memory system would not be organized first around traditional labels like L1, L2, and L3. It would be organized around data roles.

Operand tier

Very small, very fast storage for current tiles, partial sums, and immediate operands feeding the compute engines.

Hot tensor tier

A larger on-die residency tier for repeatedly used weights, hot KV regions, active expert blocks, and critical metadata.

Warm / compressed tier

An intermediate on-die or near-die layer for lower-bit or compressed forms of data likely to be promoted soon.

Bulk tier

HBM or equivalent for the broader active footprint that cannot all stay near compute at once.

This is a more AI-native way to think about the hierarchy. Instead of asking “what should be cached?” it asks “what class of state is this, how hot is it, how much does it cost to restore, and what tier behavior should it get?”

The Processor Should Know What Kind of Data It Is Holding

That may sound radical only because current architectures mostly avoid it. But for AI, it is entirely reasonable.

A new processor could explicitly distinguish between:

Weights — often immutable during inference, large, and repeatedly reused.
KV state — growing, region-structured, relevance-sensitive, and costly to keep fully hot.
Activations — short-lived but latency-sensitive, especially in training and large decode pipelines.
Expert blocks — sparse, route-dependent, and sensitive to promotion timing.
Routing and control metadata — small but operationally critical.
Optimizer state / gradients — much more important on the training side.

Each class should receive different policy. Different residency. Different compression behavior. Different prefetch treatment. Different demotion thresholds. Different protection semantics.

That is the architectural leap: not just bigger memory, but class-aware memory.

Generic Cache Replacement Is the Wrong Primitive

A conventional cache hierarchy is ultimately a guessing machine. It guesses what should stay. It guesses what can be evicted. It guesses what will be reused. Sometimes it guesses well. Sometimes it does not.

For many AI workloads, that is an avoidable weakness.

If the runtime or compiler knows that a weight block will be reused across many decode steps, the hierarchy should not be forced to rediscover that fact indirectly through recency signals. If a KV region is relevant but not urgent, the machine should be able to cool it gently instead of evicting it blindly. If an expert block has high routing likelihood, the processor should not wait until after the routing decision has already created a stall to begin restoring it.

In other words, AI-native memory wants contracts, not just hints.

Model	How it works	Main weakness for AI
Generic cache	Hardware infers reuse from access history	Weak semantic understanding of data class and restore cost
Software scratchpad only	Explicit programmer-managed locality	Too narrow and often too local in scope
AI-native residency hierarchy	Class-aware, policy-aware, runtime-visible movement and protection semantics	Better fit for weights, KV, experts, and staged activation behavior

What a New AI-Native Hierarchy Could Include

1. A Large Managed On-Die Hotset Fabric

This is where the “bigger cache” instinct belongs, but with a better framing. Build a larger on-die memory fabric whose job is not merely to be a bigger generic cache, but to act as a managed residency layer for the hottest semantically important state.

That layer might hold:

hot weight shards,
recent and high-value KV regions,
high-probability expert blocks,
critical routing state,
low-latency activations and partial state.

2. Residency Contracts

Instead of only soft hints, the architecture could provide:

wired residency for hot immutable data,
soft-pinned regions for pressure-aware graceful degradation,
generation-based invalidation for fast cleanup,
class-specific protection modes.

This is much closer to how an AI system actually wants to reason about valuable state.

3. Compression-Native Intermediate Tiers

Not every tensor needs to live in full precision all the time. A more interesting hierarchy includes warm tiers that can hold:

compressed weights,
lower-bit versions of blocks likely to be re-promoted,
prefetched but not yet fully expanded state,
delta-encoded or sparse variants.

That makes the hierarchy richer than simply “hot or cold.”

4. KV-Native Residency

KV cache is one of the most AI-specific opportunities. It is growing, structured, relevance-sensitive, and long-tailed. Treating it like generic memory misses too much of the opportunity.

An AI-native hierarchy could include:

recent-window hot KV treatment,
region-based promotion and demotion,
relevance-aware tiering,
recoverable warm states rather than cliffs.

5. Expert- and Routing-Aware Promotion

For MoE-style models, not all experts matter at all times. That means the memory system should not simply wait for demand. It should track:

route probability,
co-activation history,
restore cost,
importance under current workload state.

That is a controller problem, not a generic cache problem.

Why Hyperscalers Might Actually Want This

There is another important point here. If you are designing for hyperscalers, the optimization objective is not necessarily “minimize die area at all costs.” In many cases, the scarcer resource is power, not capex. If extra silicon can materially reduce bytes moved per token, or smooth expensive external bandwidth spikes, or improve throughput-per-watt, hyperscalers may absolutely spend the area.

That makes large on-die memory a stronger proposition than it might look in a traditional cost-sensitive client processor. The real question becomes: is larger generic cache the best use of that area, or is a more specialized AI-native residency fabric even better?

Hyperscaler View

If capex can be spent but power is the real ceiling, then on-die memory that cuts repeated external movement can be an excellent trade—even if it increases die size.

Training and Inference Are Not the Same Memory Problem

Inference is the cleaner target for this new hierarchy because weights are mostly read-only, reuse is clearer, and latency from repeated movement is easier to reason about.

Training is harder. Training introduces:

larger activation footprints,
gradient traffic,
optimizer state,
writes and updates,
much more dynamic consistency pressure.

Still, the same broader principle applies. Activations, optimizer state, gradients, and parameter blocks should not all inherit one generic memory policy. The hierarchy should know what class of state it is handling and what restore or writeback behavior that class deserves.

So Should We Still Talk About L1, L2, and L3?

Yes and no.

Yes, because physical hierarchy still exists. There will still be different latency and bandwidth zones. There will still be something closest to compute, something somewhat farther, and something bulkier.

No, because the old labels can keep us trapped in old assumptions. They encourage us to think in terms of generic caches rather than data roles, semantics, and state classes.

The better way to talk about a new AI processor is:

Operand tier
   ↓
Hot tensor residency tier
   ↓
Warm / compressed tensor tier
   ↓
Bulk HBM tier
   ↓
Cold / staged tier

Those names are closer to what the machine is actually doing.

What This Means for Architecture

If AI is forcing a clean-sheet rethink, then the biggest architectural mistake would be to merely scale the old cache hierarchy upward without changing its semantics.

The stronger path is to build:

a larger on-die memory complex,
but make it class-aware,
policy-aware,
compression-aware,
runtime-visible,
and schedule-aware.

That is the difference between “more SRAM” and a new memory paradigm.

Why This Is the Real Opportunity

The most important point is that AI does not merely want more memory. It wants memory that understands what kind of state it is holding and why that state matters.

A generic cache hierarchy asks the hardware to infer too much. An AI-native hierarchy could be told much more directly:

these are hot immutable weights,
these are relevant KV regions,
these experts are likely to be routed soon,
these activations are latency-critical,
these blocks can be demoted into compressed warm states,
these regions must not be evicted lightly.

That is a qualitatively different system.

The future of AI chips probably will include more on-die memory. But the bigger opportunity is not just a larger generic cache pyramid.

It is a memory hierarchy that stops pretending all data is the same.

That is when AI architecture stops being “old cache design, but bigger” and becomes something genuinely new.