AI Systems · Memory Fabrics · Hardware-Software Co-design

Why Cache Coherency Is the Wrong Default for AI Machines

Classical cache coherency made CPUs programmable and pleasant. AI machines live under a different regime: giant working sets, mostly read-heavy model state, throughput-first economics, and increasingly explicit data movement. In that world, strict coherency stops looking like a universal good and starts looking like an expensive default.

· · ⏱ 14 min read · Coherency · HBM · CXL · DMA · AI Inference

One of the most important shifts in AI infrastructure is easy to miss because it hides beneath the surface of memory diagrams and interconnect charts: the system is gradually moving from transparent, hardware-maintained coherence toward explicit, scheduled, software-visible movement.

That is not because coherence is bad in general. It is because the assumptions that made classical CPU-style coherency so valuable do not map cleanly onto modern AI workloads. The economics changed. The data sizes changed. The access patterns changed. The incentives changed.

This essay traces why, and what the winning architecture looks like instead.

The thesis in one glance

CPU world

Smallish working sets, unpredictable sharing, correctness via transparency.

AI world

Massive state, predictable layer flow, throughput dominates everything.

Coherence helps

Control planes, shared metadata, limited-scope collaboration.

Coherence hurts

Large read-mostly tensors, replicated state, bandwidth-sensitive inference loops.

Who this is for: System architects, GPU runtime engineers, ML infrastructure builders, and anyone wondering why ever-larger AI machines do not simply want "coherent shared memory everywhere."
~180GB
Llama-3 70B weight footprint — coherence must handle this or step aside
~3.35TB/s
H100 HBM3 peak bandwidth — what snoop traffic competes against
<5%
Typical write ratio in pure inference (weights are read-only)
64–512
GPU scale range where coherence chatter becomes a serious bottleneck

Part 01
What Cache Coherency Solved in the CPU Era

Classical cache coherency exists to maintain the illusion that multiple processors, each with private caches, still share one consistent memory space. In practical terms, it answers a deceptively hard question:

If one core changes a value, how do all the other cores avoid using stale copies of that value?

Protocols such as MESI and its descendants solved this problem brilliantly for CPU systems. They let software pretend there is one shared memory image, while hardware manages invalidations, ownership transitions, and visibility rules underneath. The elegance was real: programmers got a simple mental model, and hardware ate the complexity.

That tradeoff made sense because CPU workloads cared deeply about general-purpose programmability. Access patterns were irregular. Sharing was common. And the working sets, while meaningful, were not measured in the tens or hundreds of gigabytes that dominate large-model inference today.

Fig 1 — Classical CPU Coherency: Many Caches, One Illusion
Coherence Protocol Flow (MESI-based) Core 0 Thread A L1 Cache Core 1 Thread B L1 Cache Core 2 Thread C L1 Cache L2 / L3 Shared Cache (Snooping Bus / Directory) Shared DRAM One coherent memory image MESI Protocol M·E·S·I state per cache line Invalidate Hardware pays the complexity cost so software enjoys a simple shared-memory model. Invalidation traffic is bounded — working sets are tens of MB, not hundreds of GB.
CPU-style coherence is fundamentally about preserving a convenient programming model in the face of private caches and shared writable state. It works because CPU working sets fit within the snooping budget.

How coherence protocols evolved

1984

MESI Protocol

Modified / Exclusive / Shared / Invalid. Four-state protocol that handles most CPU sharing patterns elegantly. Birthplace of transparent shared memory.

~1990s

Directory-based coherency

As core counts grew, bus-based snooping hit bandwidth limits. Directory protocols tracked ownership per cache line, enabling larger multi-socket systems.

~2010s

NUMA-aware coherence

Non-Uniform Memory Access architectures exposed locality hints. Coherence still guaranteed, but programmers were nudged toward placement-aware patterns.

Now

AI machines question the premise

When working sets hit hundreds of GB and writes are <5% of traffic, the coherence contract becomes a cost rather than a gift.

Part 02
Why AI Machines Live in a Different Regime

AI workloads are strange from a classical systems perspective because many of the things that make coherence attractive are either weakened or inverted.

1. The data is enormous

Model weights alone can occupy tens or hundreds of gigabytes. KV caches grow with context length, batch size, and concurrency. Activations may be transient, but they can still be large. These are not tiny lines bouncing between a few L1 caches. They are industrial-scale memory objects.

A Llama-3 70B model in BF16 occupies roughly 140GB. At FP8, still ~70GB. A single inference KV cache for a 128K-token context at full batch can easily add another 40–80GB. No coherence protocol was designed with this in mind.

2. The access patterns are often predictable

Inference tends to proceed layer by layer. Attention heads operate on known tensor regions. The system often knows what tensor or shard will be needed next with high confidence. That makes prefetching, staging, and partitioned ownership far more attractive than universal transparency — because you can schedule rather than react.

3. The economics are throughput-first

In AI inference, the question is not "can two arbitrary threads share a pointer-rich structure elegantly?" The question is "can I keep the machine fed, avoid stalls, and maximize tokens per second?" In that regime, hidden coherence traffic can be more dangerous than helpful — because it competes invisibly for the same interconnect bandwidth as useful tensor movement.

CPU instinct

Make memory sharing easy and transparent.

Absorb complexity in hardware.

Optimize for irregular, unpredictable access.

Correctness-first, then performance.

AI instinct

Minimize unnecessary sharing in the first place.

Make movement explicit when it improves throughput.

Exploit layer-order predictability for prefetch.

Throughput-first, then correctness by design.

Fig 2 — AI Machine Memory Hierarchy: Explicit Staged Movement
GPU DIE SM Cluster L1 / Shared SM Cluster L1 / Shared SM Cluster L1 / Shared L2 Cache (50MB) HBM3 — 80GB / 3.35 TB/s Active working set: current layers + KV cache ⚡ Explicit DMA from host DRAM / NVLink peer Host DRAM — 1–2 TB Staging area: upcoming layers, partial KV state CXL.mem Expansion Pool Capacity ✓ — Broad coherency? Careful. NVMe / Object Storage Cold weights, full KV offload, checkpoint storage PCIe / NVLink Explicit DMA MOVEMENT DISCIPLINE HBM = execution working set. Not a warehouse. Arrows = explicit, scheduled DMA — not coherent snoop traffic. Host DRAM = staging buffer. Upcoming layers prefetched here. Cold storage = weights at rest. Paged in by policy, not demand.
Unlike the CPU hierarchy where caches are invisible to software, AI memory is increasingly a software-scheduled pipeline. The runtime decides what moves, when, and where — not hardware coherence logic.

Part 03
The Crucial Inversion

Classical coherency asks:

How do we keep multiple copies of data consistent automatically?

AI systems increasingly ask a different question:

Why do we have multiple writable copies of this data at all?

That is the deeper inversion. Once weights are mostly read-only, KV state is naturally localized per request or shard, and runtime execution is structured around bounded working sets, the justification for broad, always-on coherence weakens.

This is not just a performance observation. It is a design philosophy shift. The question changes from "how do we synchronize?" to "how do we avoid needing to synchronize?" Once you orient engineering decisions around the second question, entirely different architectures become attractive.

Key insight: Coherency protocols are fundamentally a synchronization mechanism designed to reconcile divergent state. If the design minimizes divergent state in the first place — through clear ownership and read-mostly data — coherency solves a problem that barely exists.

Part 04
Why Many AI Data Structures Don't Want Full Coherency

Weights are mostly read-only

Read-only data is the easiest kind of data to scale. You can replicate it, shard it, stream it, or stage it without paying the full machinery cost of invalidation-heavy writable sharing. If a tensor is immutable for the duration of inference, global coherence is solving a problem that barely exists. The MESI protocol's Modified and Exclusive states exist for writable data — if you never write, you never need them.

KV cache is often naturally partitioned

A sequence's KV state is typically associated with that sequence, that request, or that serving shard. It is not a universally writable global object that every compute element updates simultaneously. The fastest design is often one where ownership is clear and movement is deliberate. Prefix-sharing KV caches introduce some shared state, but even those can be managed with ownership boundaries rather than cache-line coherency.

Activations are short-lived execution artifacts

These tend to be local, ephemeral, and heavily tied to the immediate execution phase. The lifetime of an activation tensor spans a forward pass, not a distributed system lifetime. Again, the most efficient pattern is often local scope plus explicit transfer, not global cache visibility.

Big idea: if the data is immutable, append-oriented, or clearly partitioned, then strong system-wide coherency is often overkill. The runtime can win by preventing ambiguous sharing instead of paying to clean it up later.
Fig 3 — AI Data Structure Taxonomy: Coherency Need vs. Data Size
NON-COHERENT ZONE SELECTIVE COHERENCY ZONE Data Size / Memory Footprint → Coherency Need → KB MB GB 10s GB 100s GB None Med High Control Metadata Queue State Trans. Activations KV Cache per-req Audit Logs Model Weights ~70-180 GB Coherent Ownership-based Explicit movement Non-coherent (staged)
The larger and more read-heavy the data, the less coherency earns its cost. Control metadata (small, frequently read) belongs in coherent regions. Model weights (massive, read-only at inference) belong in staged, explicitly-moved pipelines.

Part 05
What Replaces "Coherency Everywhere"

It is tempting to think the alternative to coherence is chaos. It is not. The replacement is usually a mix of three disciplines:

Partitioned ownership

Each GPU or engine owns a shard, request batch, or tensor region. If ownership boundaries are clear, the system avoids expensive coherence chatter because cross-device writes are simply rarer. Tensor parallelism, pipeline parallelism, and expert parallelism in MoE models are all expressions of this design instinct — divide the work so that coordination is the exception, not the rule.

There is also a second-order security benefit here. If sensitive execution artifacts — such as reasoning traces or immutable audit logs — live in a non-coherent, append-only region with tightly scoped ownership, they become harder to mutate invisibly via ordinary shared-cache behavior. Reduced coherency can therefore improve not just performance, but auditability and tamper resistance.

Explicit movement

Instead of pretending every tier is one magical shared pool, the runtime issues copies, DMA transfers, prefetches, and evictions consciously. This looks less elegant from a classical CPU perspective, but it is often far more predictable at scale. When you issue an explicit DMA, you know the cost. When hardware coherence fires invisibly, you often don't — until you see the interconnect utilization chart.

Selective coherency

Not everything should be non-coherent. Small control structures, scheduling metadata, queue state, and orchestration information often do benefit from coherence. The key is scope: coherent where sharing is real and state is small, non-coherent where data is huge and movement dominates.

System element
Recommended policy
Rationale
Model weights
Non-coherent Staged, read-mostly
Huge footprint; invalidation-heavy sharing buys little. Replicate freely.
KV cache
Ownership-based Localized
Per-request or per-shard state benefits from clear placement. Prefix sharing needs ownership, not snoop.
Control metadata
Selective coherency
Small, frequently read, coordination-critical state fits coherence well. Keep it scoped.
Reasoning/audit logs
Append-only Tightly owned
Improves auditability and reduces silent mutation risk. Non-coherent by design.
Transient activations
Local + explicit movement
Short-lived execution artifacts. Do not need global visibility. Live and die within a pass.
Scheduler / dispatcher state
Coherent Scoped carefully
Low volume, high coordination value. This is what coherency was designed for.

Part 06
CXL.mem vs CXL.cache — Why the Distinction Matters

This does not mean coherent fabrics are useless. In fact, the attraction is obvious. Technologies like CXL tempt architects with a beautiful dream: larger memory pools, simpler programming, and fewer explicit copy boundaries between devices and hosts.

But it is important to separate two different ideas that often get blurred together under the CXL umbrella.

CXL.mem is fundamentally about exposing larger or more flexible memory capacity to a requester. CXL.cache is about maintaining coherency semantics across that boundary. Those are related, but they are not identical benefits.

For AI systems, the attractive part may often be the former without the full burden of the latter: more capacity, more flexible pooling, and more staging options, without assuming that every large tensor should participate in a broadly coherent fabric all the time.

And the cost is not abstract. Coherence protocols consume real fabric bandwidth and real latency budget through snooping traffic, directory lookups, invalidation messages, acknowledgment traffic, and ownership churn. At scale, that chatter competes directly with useful tensor movement for the same interconnect budget.

Fig 4 — CXL Architecture: What AI Systems Actually Want
HOST PROCESSOR CPU + Caches Host DRAM CXL Root Complex CXL Bus (PCIe 5.0+) CXL.mem Capacity · Pooling · Staging ✓ AI wants this CXL.cache Host coherency semantics ⚠ Use carefully at scale Recommended: Selective Coherency CXL.mem for capacity + CXL.cache scoped to control/metadata only bulk tensors → non-coherent · small shared state → coherent AI ACCELERATOR GPU / TPU / ASIC HBM Working Set CXL Endpoint AI systems may want CXL's extra capacity more broadly than they want broad coherency across bulk tensors. The risk: accidentally trading a programmability win for a scalability loss.
The winning design is likely hybrid: capacity expansion and pooling from CXL.mem, tightly scoped coherency for control state via CXL.cache, and explicit DMA pipelines for bulk tensor movement. Not universal coherence.

Part 07
MESI at Scale: Where the Protocol Breaks Down

To understand why coherency becomes expensive at AI scale, it helps to trace what happens to a standard MESI-like protocol as you move from 8 CPU cores to 512 GPU accelerators sharing tensor state.

Fig 5 — MESI State Machine and the AI Scaling Problem
MESI STATE MACHINE M Modified E Exclusive S Shared I Invalid Write Local write Snoop read Invalidate Read Invalidate For read-only AI weights: only S→I transitions = coherency machinery for a problem that doesn't exist COHERENCE OVERHEAD AT GPU SCALE 8 cores — snoop fan-out: ~7 messages/write 32 cores — ~31 messages/write 128 GPUs — ~127 messages/write 512 GPUs — ~511 msgs/write ← bandwidth catastrophe Coherence traffic composition at 512 GPUs 40% — Invalidation broadcasts 30% — Snoop requests 24% — Ownership transfers 6% — Useful tensor data At full broadcast coherency, only 6% of fabric traffic is useful work.
In a fully coherent N-device system, each write generates O(N) messages. For CPU clusters of 32–64 cores, this is manageable. For 512-GPU inference clusters with 180GB weight matrices, the snoop fan-out becomes a dominant bandwidth consumer — leaving little headroom for actual tensor movement.

Part 08
The Working-Set View of AI Memory

If GPU memory is a bounded execution working set rather than a permanent warehouse, then the case for broad coherence weakens even further.

Under a working-set view, only a narrow slice of the model needs to be resident in HBM at a given time. Upcoming layers can be prefetched. Old layers can be evicted. KV cache can remain local to the active serving context. The machine is not trying to keep a universal, globally writable picture of everything alive at once. It is trying to keep the right things hot at the right time.

In a working-set architecture, the first question is not "is everything coherent?" It is "is the right data resident exactly when compute needs it?"

That subtle change has enormous implications. It moves the system from coherence-first design toward schedule-first design. The runtime becomes responsible for data placement as a first-class concern, not a hardware afterthought.

Fig 6 — Working Set View: Temporal Locality in Layer-by-Layer Inference
HBM DRAM NVMe Inference time → (layer 1 through layer N) Layer 1 ACTIVE Layer 2 ACTIVE Layer 3 ACTIVE Layer N ACTIVE ··· Layer 2 PREFETCH Layer 3 PREFETCH Layer 4 PREFETCH Layers 4–8 COLD Layers 9–N COLD DMA DMA Evict Working-set model: only active layer + KV cache in HBM. Runtime prefetches next, evicts done. No coherence contract needed.
The working-set architecture treats HBM as a fast execution scratchpad, not a persistent coherent store. Layers are prefetched from DRAM/NVMe into HBM just-in-time, and evicted after use. This is schedule-first design — and it needs explicit movement, not coherency protocols.

Part 09
My View: Coherence Is Being Demoted, Not Eliminated

I do not think future AI systems will be fully coherent, and I do not think they will be fully non-coherent either. The likely answer is a hybrid model in which coherence becomes a scoped tool rather than a universal assumption.

In that model:

That is the deeper philosophical shift. Memory consistency stops being purely a hardware guarantee and becomes partly a software-visible orchestration problem. The hardware still handles coherency where needed — but the programmer (or the compiler) now has to think about movement in a way that CPU programmers never had to.

Part 10
Why This Matters for Next-Generation AI Infrastructure

Once you accept that full coherency is not the correct default, a lot of modern design decisions start making more sense:

The machines are getting larger, but the winning abstraction is not necessarily "make the whole machine look like one giant coherent CPU." It may be the opposite: make ownership, placement, and movement more explicit so the system can scale without drowning in invisible coordination cost.

Conclusion
The Demotion of Coherency

The CPU era taught us to love cache coherency because it made a hard problem disappear. AI machines are teaching us that sometimes making a problem disappear is too expensive.

For large-model systems, the path forward is likely not universal coherency but selective coherency plus explicit movement: coherent where sharing is small and meaningful, non-coherent where data is huge and structured, and orchestrated everywhere that bandwidth and timing determine performance.

The deeper shift is philosophical. The programming model for AI machines is not a faster CPU. It is closer to a distributed streaming system that happens to have very fast local memory. In that model, the right abstraction is not a shared address space everyone can write to freely. It is a network of ownership regions, connected by scheduled, explicit data flows.

Core takeaway: Cache coherency is not being abandoned because it failed. It is being narrowed because AI machines care more about predictable data movement than universal transparency. The hardware engineers figured this out first. Now the software stack is catching up.