There is a certain kind of architectural debate that shows up over and over again in AI systems. One side says the answer is more compute. Another says the answer is more HBM. Another says quantization will fix it. Another says software should just schedule things better.
And then there is the quiet, stubborn view that keeps coming back: what if the simplest high-return lever is just more L2?
I think that instinct is much more serious than people often admit.
Larger L2 can absolutely be one of the best trades available in AI hardware, especially at hyperscaler scale where the binding resource is often power and bytes moved per token, not absolute willingness to spend more silicon. But the way to argue for bigger L2 properly is not to say “cache good” in a generic sense. The real case is subtler and much more interesting.
Bigger L2 usually helps not because it makes cache hits dramatically faster. It helps because it can reduce misses, reduce off-chip traffic, improve average access time, smooth bandwidth spikes, and lower the energy the system spends dragging the same state back through the hierarchy.
The best reason to build a bigger L2 is not that it makes hits magical. It is that it helps you avoid catastrophically slower misses.
Start with the Right Metric: Average Access Time, Not Raw Hit Latency
When people talk casually about cache performance, they often ask the wrong first question: “How much faster is the bigger cache?” That framing is misleading because a larger L2 does not usually buy its value by making the L2 itself much faster. In fact, a larger L2 can make hit latency a little worse because the array is larger, the wires are longer, the banking is more complex, and the tag/data path may get heavier.
The real metric is average memory access time:
Average access time = hit time + miss rate × miss penalty
That is the equation that matters.
If bigger L2 increases hit time slightly but reduces miss rate substantially, the average still improves. And because the miss penalty to HBM or DRAM can be so much larger than the internal cache-hit cost, even modest reductions in miss rate can deliver outsized system-level benefit.
A Simple Numerical Intuition
Suppose a baseline L2 has a 20-cycle hit latency and a 20% miss rate to a much slower tier whose effective penalty is 300 cycles. The average comes out like this:
Old average = 20 + 0.20 × 300 = 80 cycles
Now suppose a larger L2 is a little slower to hit—say 22 cycles—but drops miss rate to 8%:
New average = 22 + 0.08 × 300 = 46 cycles
In that example, the larger L2 is not better because the L2 itself became dramatically faster. It is better because the machine is much less often forced into the really expensive path.
That is the right way to think about bigger L2 in AI systems. Its value is often miss prevention, not hit heroics.
Why This Matters More in AI Than in Many Traditional Workloads
AI systems are unusually sensitive to the cost of moving bytes. Large inference workloads, and increasingly many training regimes, burn significant time and energy not on arithmetic itself but on getting the right data back to the right place. Repeated weight access, staged activation movement, KV reuse, and route-dependent expert loading all turn memory behavior into a first-order performance problem.
That is why larger L2 can matter so much. It can help keep a larger effective hotset close enough to compute that the machine stops paying as many expensive trips outward.
This can show up as:
- fewer HBM accesses for repeatedly touched metadata or hot tensor regions,
- less pressure on shared bulk-memory bandwidth,
- fewer bursty refills,
- better smoothing of irregular or semi-structured reuse,
- more resilience to imperfect scheduling.
In other words, larger L2 can sometimes buy you a more forgiving machine.
What Bigger L2 Actually Buys You
Those are the real gains. Notice that none of them require a dramatic reduction in L2 hit latency itself.
But Bigger L2 Does Not Solve Everything
Here is where the nuance matters. Bigger L2 is a strong lever. It is not a complete memory strategy.
Why not? Because many AI system failures are not caused by “L2 too small” in isolation. They are caused by a larger collection of issues:
- the working set still dwarfs any realistic on-die cache,
- replacement policy does not understand true reuse structure,
- the machine is paying for repeated repacking, refill, and staging across other boundaries,
- the data classes are too different for one generic policy,
- the real problem is tier orchestration, not just cache size.
So yes, bigger L2 can make the machine better. But no, it does not automatically solve repeated refill, poor residency semantics, or bandwidth amplification across the full hierarchy.
Larger L2 is one of the best hardware levers available. It is just not the same thing as a complete memory policy, and it cannot substitute for better semantics around weights, KV, activations, or expert state.
Where Bigger L2 Helps the Most
There are some situations where larger L2 is especially attractive.
1. Reuse That Is Real but Not Perfectly Predictable
If the workload has genuine reuse but the exact timing is not rigid enough for a fully scheduled scratchpad model, a larger L2 gives the hardware more room to catch value that would otherwise spill outward.
2. Bursty or Imperfectly Phased Access
Larger L2 can absorb bursts, phase transitions, and temporary locality mismatches better than a very tight cache. It can smooth the machine.
3. Shared Reuse Across More Compute Agents
A bigger shared L2 or system cache can help capture value that is not local to one core/SM/cluster alone. This matters for many accelerator-style workloads.
4. Systems Under Bandwidth Pressure
If HBM is already heavily contested, every miss avoided can matter twice: once in latency and once in the reduction of bandwidth congestion for everything else.
Where Bigger L2 Helps Less Than People Hope
Now the other side.
There are also situations where adding more L2 has diminishing or disappointing returns.
| Situation | Why bigger L2 may help less | What might matter more |
|---|---|---|
| Hotset still vastly exceeds cache | Miss-rate curve may flatten before the working set is meaningfully captured | Better tiering, compression, explicit residency |
| Access policy is wrong | A larger cache can still thrash if replacement semantics do not fit the workload | Class-aware residency or software-visible protection |
| Movement cost is elsewhere | The dominant tax may be repacking, conversion, gather, or other hierarchy boundaries | Dataflow redesign, better layout, compression engines |
| Data types are semantically different | One generic cache policy may be too weak even if the cache is large | AI-native class-aware hierarchy |
The Die Area and Power Question Is Real
You cannot talk seriously about bigger L2 without talking about the price.
Every extra chunk of L2 burns:
- bitcell area,
- tag overhead,
- banking logic,
- routing resources,
- leakage power,
- dynamic read/write energy.
And just as importantly, that silicon could have been spent elsewhere:
- more compute,
- more near-compute SRAM,
- better on-die interconnect,
- compression and decompression hardware,
- better memory controllers,
- residency-control primitives,
- more specialized scheduling hardware.
So the right argument for bigger L2 is never “bigger L2 costs nothing.” It is “given the alternatives, bigger L2 might be the highest-return use of the next unit of silicon.”
Why Hyperscalers May Be the Natural Buyers
This is where the economics get interesting.
If you are building for hyperscalers, the constraint is often not simply die size in the abstract. It is a more complicated economic surface involving energy, throughput stability, rack-level power limits, and how many useful tokens or training steps the machine can produce per watt and per unit time.
In that world, spending more silicon to reduce external movement can be a very rational choice. If the result is a machine that delivers better sustained throughput under real load and burns less energy pulling the same state repeatedly from outer tiers, then a larger L2 can be a very high-ROI trade.
If capex can be spent but power is scarce, then more on-die memory can be one of the cleanest ways to buy efficiency.
The Hidden Win: A Bigger L2 Can Make Imperfect Software Less Fragile
One underappreciated virtue of larger L2 is that it gives the rest of the stack more margin for error.
A perfectly staged, perfectly class-aware, perfectly scheduled memory system might not need as much generic cache safety net. But real systems are rarely perfect. Runtime heuristics are imperfect. Scheduling is imperfect. Relevance signals are noisy. Workload mixtures are messy.
A larger L2 can make the machine less brittle under those imperfections. That is not as elegant as a full AI-native memory hierarchy, but it is often extremely valuable in production.
So Is Bigger L2 a Strategy?
It is a lever. Sometimes a very strong one. But by itself, it is not the full strategy.
The balanced view is this:
Bigger L2 helps when it: - cuts expensive misses - relieves bandwidth pressure - reduces movement energy - captures meaningful hotset structure - makes the machine more stable under imperfect scheduling But it stops being enough when: - the working set still dwarfs it - the policy is wrong - the semantics of reuse are class-specific - the movement tax is dominated by other hierarchy boundaries
That is a serious architectural position. It is neither anti-cache rhetoric nor naive cache maximalism.
Where This Leads
My own view is that larger L2 should be taken more seriously in AI chips than many people currently take it. The instinct that “more on-die memory helps” is often right.
But the endpoint is probably not just more L2. The endpoint is a richer hierarchy in which the system understands data classes better and can apply more meaningful residency semantics to them.
That means larger generic cache may be one important transitional step—or one important part of the final design—but the deeper future is likely to involve:
- managed hot tensor tiers,
- KV-specific residency behavior,
- expert-aware promotion,
- compressed warm tiers,
- runtime-visible residency controls,
- and better contracts than generic replacement alone can provide.
Final Thought
The best way to think about bigger L2 is not as a magic performance button. It is as a very powerful way to buy fewer misses, fewer expensive outward trips, fewer wasted joules, and a more stable machine.
That is already a lot.
And if AI is forcing us to redesign the hierarchy anyway, then bigger L2 is not the end of the story. It is the beginning of a much more interesting one.