Classical cache coherency made CPUs programmable and pleasant. AI machines live under a different regime: giant working sets, mostly read-heavy model state, throughput-first economics, and increasingly explicit data movement. In that world, strict coherency stops looking like a universal good and starts looking like an expensive default.
One of the most important shifts in AI infrastructure is easy to miss because it hides beneath the surface of memory diagrams and interconnect charts: the system is gradually moving from transparent, hardware-maintained coherence toward explicit, scheduled, software-visible movement.
That is not because coherence is bad in general. It is because the assumptions that made classical CPU-style coherency so valuable do not map cleanly onto modern AI workloads. The economics changed. The data sizes changed. The access patterns changed. The incentives changed.
This essay traces why, and what the winning architecture looks like instead.
Smallish working sets, unpredictable sharing, correctness via transparency.
Massive state, predictable layer flow, throughput dominates everything.
Control planes, shared metadata, limited-scope collaboration.
Large read-mostly tensors, replicated state, bandwidth-sensitive inference loops.
Classical cache coherency exists to maintain the illusion that multiple processors, each with private caches, still share one consistent memory space. In practical terms, it answers a deceptively hard question:
Protocols such as MESI and its descendants solved this problem brilliantly for CPU systems. They let software pretend there is one shared memory image, while hardware manages invalidations, ownership transitions, and visibility rules underneath. The elegance was real: programmers got a simple mental model, and hardware ate the complexity.
That tradeoff made sense because CPU workloads cared deeply about general-purpose programmability. Access patterns were irregular. Sharing was common. And the working sets, while meaningful, were not measured in the tens or hundreds of gigabytes that dominate large-model inference today.
Modified / Exclusive / Shared / Invalid. Four-state protocol that handles most CPU sharing patterns elegantly. Birthplace of transparent shared memory.
As core counts grew, bus-based snooping hit bandwidth limits. Directory protocols tracked ownership per cache line, enabling larger multi-socket systems.
Non-Uniform Memory Access architectures exposed locality hints. Coherence still guaranteed, but programmers were nudged toward placement-aware patterns.
When working sets hit hundreds of GB and writes are <5% of traffic, the coherence contract becomes a cost rather than a gift.
AI workloads are strange from a classical systems perspective because many of the things that make coherence attractive are either weakened or inverted.
Model weights alone can occupy tens or hundreds of gigabytes. KV caches grow with context length, batch size, and concurrency. Activations may be transient, but they can still be large. These are not tiny lines bouncing between a few L1 caches. They are industrial-scale memory objects.
A Llama-3 70B model in BF16 occupies roughly 140GB. At FP8, still ~70GB. A single inference KV cache for a 128K-token context at full batch can easily add another 40–80GB. No coherence protocol was designed with this in mind.
Inference tends to proceed layer by layer. Attention heads operate on known tensor regions. The system often knows what tensor or shard will be needed next with high confidence. That makes prefetching, staging, and partitioned ownership far more attractive than universal transparency — because you can schedule rather than react.
In AI inference, the question is not "can two arbitrary threads share a pointer-rich structure elegantly?" The question is "can I keep the machine fed, avoid stalls, and maximize tokens per second?" In that regime, hidden coherence traffic can be more dangerous than helpful — because it competes invisibly for the same interconnect bandwidth as useful tensor movement.
Make memory sharing easy and transparent.
Absorb complexity in hardware.
Optimize for irregular, unpredictable access.
Correctness-first, then performance.
Minimize unnecessary sharing in the first place.
Make movement explicit when it improves throughput.
Exploit layer-order predictability for prefetch.
Throughput-first, then correctness by design.
Classical coherency asks:
AI systems increasingly ask a different question:
That is the deeper inversion. Once weights are mostly read-only, KV state is naturally localized per request or shard, and runtime execution is structured around bounded working sets, the justification for broad, always-on coherence weakens.
This is not just a performance observation. It is a design philosophy shift. The question changes from "how do we synchronize?" to "how do we avoid needing to synchronize?" Once you orient engineering decisions around the second question, entirely different architectures become attractive.
Read-only data is the easiest kind of data to scale. You can replicate it, shard it, stream it, or stage it without paying the full machinery cost of invalidation-heavy writable sharing. If a tensor is immutable for the duration of inference, global coherence is solving a problem that barely exists. The MESI protocol's Modified and Exclusive states exist for writable data — if you never write, you never need them.
A sequence's KV state is typically associated with that sequence, that request, or that serving shard. It is not a universally writable global object that every compute element updates simultaneously. The fastest design is often one where ownership is clear and movement is deliberate. Prefix-sharing KV caches introduce some shared state, but even those can be managed with ownership boundaries rather than cache-line coherency.
These tend to be local, ephemeral, and heavily tied to the immediate execution phase. The lifetime of an activation tensor spans a forward pass, not a distributed system lifetime. Again, the most efficient pattern is often local scope plus explicit transfer, not global cache visibility.
It is tempting to think the alternative to coherence is chaos. It is not. The replacement is usually a mix of three disciplines:
Each GPU or engine owns a shard, request batch, or tensor region. If ownership boundaries are clear, the system avoids expensive coherence chatter because cross-device writes are simply rarer. Tensor parallelism, pipeline parallelism, and expert parallelism in MoE models are all expressions of this design instinct — divide the work so that coordination is the exception, not the rule.
There is also a second-order security benefit here. If sensitive execution artifacts — such as reasoning traces or immutable audit logs — live in a non-coherent, append-only region with tightly scoped ownership, they become harder to mutate invisibly via ordinary shared-cache behavior. Reduced coherency can therefore improve not just performance, but auditability and tamper resistance.
Instead of pretending every tier is one magical shared pool, the runtime issues copies, DMA transfers, prefetches, and evictions consciously. This looks less elegant from a classical CPU perspective, but it is often far more predictable at scale. When you issue an explicit DMA, you know the cost. When hardware coherence fires invisibly, you often don't — until you see the interconnect utilization chart.
Not everything should be non-coherent. Small control structures, scheduling metadata, queue state, and orchestration information often do benefit from coherence. The key is scope: coherent where sharing is real and state is small, non-coherent where data is huge and movement dominates.
This does not mean coherent fabrics are useless. In fact, the attraction is obvious. Technologies like CXL tempt architects with a beautiful dream: larger memory pools, simpler programming, and fewer explicit copy boundaries between devices and hosts.
But it is important to separate two different ideas that often get blurred together under the CXL umbrella.
CXL.mem is fundamentally about exposing larger or more flexible memory capacity to a requester. CXL.cache is about maintaining coherency semantics across that boundary. Those are related, but they are not identical benefits.
For AI systems, the attractive part may often be the former without the full burden of the latter: more capacity, more flexible pooling, and more staging options, without assuming that every large tensor should participate in a broadly coherent fabric all the time.
And the cost is not abstract. Coherence protocols consume real fabric bandwidth and real latency budget through snooping traffic, directory lookups, invalidation messages, acknowledgment traffic, and ownership churn. At scale, that chatter competes directly with useful tensor movement for the same interconnect budget.
To understand why coherency becomes expensive at AI scale, it helps to trace what happens to a standard MESI-like protocol as you move from 8 CPU cores to 512 GPU accelerators sharing tensor state.
If GPU memory is a bounded execution working set rather than a permanent warehouse, then the case for broad coherence weakens even further.
Under a working-set view, only a narrow slice of the model needs to be resident in HBM at a given time. Upcoming layers can be prefetched. Old layers can be evicted. KV cache can remain local to the active serving context. The machine is not trying to keep a universal, globally writable picture of everything alive at once. It is trying to keep the right things hot at the right time.
That subtle change has enormous implications. It moves the system from coherence-first design toward schedule-first design. The runtime becomes responsible for data placement as a first-class concern, not a hardware afterthought.
I do not think future AI systems will be fully coherent, and I do not think they will be fully non-coherent either. The likely answer is a hybrid model in which coherence becomes a scoped tool rather than a universal assumption.
In that model:
That is the deeper philosophical shift. Memory consistency stops being purely a hardware guarantee and becomes partly a software-visible orchestration problem. The hardware still handles coherency where needed — but the programmer (or the compiler) now has to think about movement in a way that CPU programmers never had to.
Once you accept that full coherency is not the correct default, a lot of modern design decisions start making more sense:
The machines are getting larger, but the winning abstraction is not necessarily "make the whole machine look like one giant coherent CPU." It may be the opposite: make ownership, placement, and movement more explicit so the system can scale without drowning in invisible coordination cost.
The CPU era taught us to love cache coherency because it made a hard problem disappear. AI machines are teaching us that sometimes making a problem disappear is too expensive.
For large-model systems, the path forward is likely not universal coherency but selective coherency plus explicit movement: coherent where sharing is small and meaningful, non-coherent where data is huge and structured, and orchestrated everywhere that bandwidth and timing determine performance.
The deeper shift is philosophical. The programming model for AI machines is not a faster CPU. It is closer to a distributed streaming system that happens to have very fast local memory. In that model, the right abstraction is not a shared address space everyone can write to freely. It is a network of ownership regions, connected by scheduled, explicit data flows.