Memory Systems · Coherence · Agentic AI

Coherent Fabrics:
The Memory Highways
Behind Agentic AI

The next bottleneck in AI is not raw GPU compute. It is how CPUs, GPUs, accelerators, NICs, and memory pools coordinate around gigantic, fast-moving state. Coherent fabrics are the hardware idea that tries to make that coordination feel like one shared memory system.

May 12, 2026 11 sections ~20 min read CXL · NVLink · MESI · KV-cache Deep technical

128k+

tokens in modern long-context windows

~256 GB

KV cache for one 128k session on a large model

~28 GB/s

PCIe 4.0 peak — copy bottleneck

~900 GB/s

NVLink 4.0 GPU-to-GPU bandwidth

<1 μs

CXL coherent memory access target latency

Section 01

Why coherent fabrics matter now

AI infrastructure is undergoing a fundamental shift — from compute-centric to memory-centric. The GPU is increasingly the thing that waits, not the thing that bottlenecks.

For years the dominant AI story was linear: more GPUs, more FLOPS, bigger models, faster training. That equation still holds for raw throughput, but it breaks down the moment you zoom out to the full inference loop. Once models grow large, context windows lengthen to hundreds of thousands of tokens, and agentic systems start calling tools, reading codebases, retrieving documents, and managing multi-step workflows — the bottleneck quietly migrates.

The central shift

The question is no longer only "how fast can the GPU multiply matrices?" It is: where is the data, who owns it, how many copies exist, and how quickly can the next processor see the latest version?

That is exactly the territory of coherent fabrics. They sit underneath CPUs, GPUs, memory expansion devices, SmartNICs, and accelerators, trying to turn fragmented memory islands into something that behaves like a single coordinated system.

The inflection point is roughly 2024, when long-context serving and agentic workflows made memory orchestration the dominant cost.

# Old bottleneck
GPU compute → buy more H100s → solved

# New bottleneck
memory movement + synchronization + orchestration
→ buying more GPUs often makes it worse
  (more devices = more copies = more coherence traffic)

Section 02

First: what is a "fabric"?

In hardware systems, a fabric is the communication network connecting compute and memory components. It is the highway system between chips, sockets, accelerators, and racks. Not a single bus — a structured mesh of links, switches, and protocols.

Compute

Processors

CPUs, GPUs, DPUs, FPGAs, ASICs, custom AI accelerators. Anything that reads and writes memory to do work.

Memory

Storage layers

DRAM, HBM, CXL memory expanders, persistent memory, NVMe SSDs, memory pools attached over the fabric.

I/O

Movement engines

SmartNICs, DMA engines, PCIe switches, NVMe controllers. Things whose job is moving data efficiently.

Examples of fabrics include PCIe, NVLink/NVSwitch, AMD Infinity Fabric, and CXL-based fabrics. The critical distinction is what each fabric carries. Some are pure transport — they move bytes. Others carry memory semantics: ownership, visibility, validity. A coherent fabric participates in memory consistency, not just packet delivery.

A coherent fabric is not just a cable — it is a structured interconnect plus a memory-ownership protocol shared by all attached devices.

Section 03

Then: what does "coherent" mean?

Coherence is about making sure that when multiple processors cache the same memory location, they do not silently disagree about what the current value is.

Every modern CPU caches memory in L1/L2/L3 to avoid slow DRAM accesses. When you have one CPU, there's no conflict. When you have two CPUs sharing RAM, both can cache the same address simultaneously — and one can modify it without the other knowing. Same problem, amplified, when GPUs and accelerators enter the picture.

The coherence problem. CPU writes X=5 into its cache. GPU still holds the old value X=3. Without protocol enforcement, both believe they are correct.

Modern processors use cache coherence protocols to track the state of every cache line across all devices. The famous family of protocols is MESI and its variants. Before getting to those, note the key insight:

Coherence = hardware-assisted agreement about the latest memory state

# Without it:
programmer must:  flush → copy → barrier → invalidate → synchronize
# every handoff between devices is a bespoke software dance

# With it:
hardware tracks who has which copy
# and automatically invalidates or updates stale entries

Section 04

The MESI protocol: how hardware tracks ownership

MESI is the dominant family of cache coherence protocols. The name is an acronym for the four states a cache line can be in. Understanding MESI gives you the mental model for every coherent fabric — even CXL and NVLink extensions ultimately build on these ideas.

State M

Modified

This cache has the only valid copy. It has been written. Main memory is stale. This cache owns write-back responsibility.

State E

Exclusive

This cache has the only copy. It matches main memory. No other cache holds it. Can be silently promoted to M on write.

State S

Shared

Multiple caches hold clean copies. All match memory. Any write must broadcast an invalidation to all sharers first.

State I

Invalid

This cache line is stale or evicted. Any access triggers a cache miss and a fetch from another cache or memory.

The MESI state machine. Every cache line in every processor is always in exactly one of these four states. Coherence protocols are the rules governing transitions.

MOESI adds a fifth state — Owned — allowing a cache to satisfy another cache's read request directly (cache-to-cache transfer) without going to main memory first. This matters enormously for multi-GPU systems where you want GPUs to share data without bouncing through CPU memory.

Why this matters for AI

When two GPUs share a KV cache prefix, the coherence protocol determines whether they can read it simultaneously (Shared state), who must invalidate when one writes (I transition), and whether data can move GPU-to-GPU (MOESI Owned) or must round-trip through CPU RAM. The fabric and protocol together determine the actual bandwidth and latency the AI system sees.

Section 05

Non-coherent vs coherent systems

The clearest way to understand coherent fabrics is to see what changes when you remove the coherence guarantee.

Non-coherent

Separate memory islands

# programmer manages everything
CPU RAM → copy → GPU VRAM
GPU writes result
GPU VRAM → copy → CPU RAM
CPU uses result

# plus manually:
flush · barrier · invalidate

Works, but every handoff is explicit programmer work. Two copies always exist. Latency is additive. Errors are silent.

Coherent

Shared address space

# hardware manages consistency
CPU and GPU access same address
Fabric tracks ownership
Stale copies auto-invalidated

# programmer writes:
ptr[i] = value   // just works

Reduces copies and synchronization overhead. Makes heterogeneous programming feel more natural. But not free — coherence traffic has cost.

Dimension	Non-coherent	Coherent fabric
Data sharing	Explicit copies required	Shared memory semantics possible
Synchronization	Software: flush / invalidate / barrier	Hardware-assisted visibility
Copy count	O(n) copies for n devices	Hardware can share single copy
Programming model	Manual memory management (`cudaMemcpy`)	Unified address space possible
Latency profile	Predictable but high (copy time)	Low average, occasional coherence stalls
Scaling challenge	Copy overhead grows with device count	Coherence traffic grows with sharer count
Best for	Bulk, one-directional data movement	Fine-grained shared mutable state

Section 06

Why AI systems care so much

AI used to be mostly about weights. Agentic AI is increasingly about state.

Model weights are large but relatively static during inference — they're loaded once and read repeatedly. The dynamic picture is far messier. Every step of an agentic loop creates and consumes state: prompts, retrieved documents, embeddings, tool outputs, intermediate plans, logs, and most importantly — KV cache.

In transformer inference, the KV cache stores key/value tensors for every token in the current context. Subsequent tokens attend over these without recomputing. With a 128k-token context on a large model, the KV cache can be 200–400 GB per session — easily larger than the weights themselves.

KV cache grows roughly linearly with context length (per layer × per token). At 128k tokens on a 70B model, you're looking at 200–400 GB — far beyond a single GPU's VRAM.

Imagine two coding agents both working with the same 100k-token repository context. A naive system holds two full copies of the KV cache. A coherent-fabric-aware system routes both agents to the same memory, shares the read-only prefix, and only diverges state at the point the agents differ. The difference is 2× the memory, or 1×.

Once you have prefix sharing, you also need prefix routing — a scheduler that knows where each KV prefix lives and steers new requests toward it rather than migrating gigabytes of state. That is why coherent fabrics and KV routing are deeply coupled problems.

# Scale the problem
1 agent, 128k ctx  →  ~256 GB KV
100 agents, shared 100k prefix  →  25.6 TB naive  vs  256 GB + 100 × delta

# The fabric makes sharing possible
# The scheduler makes sharing practical

Section 07

Real-world fabrics and their tradeoffs

No single fabric solves every layer of the problem. Different vendors attack different segments of the memory hierarchy.

CXL — Compute Express Link

CXL is an industry-backed cache-coherent interconnect built on top of PCIe physical infrastructure, backed by Intel, AMD, NVIDIA, Samsung, and others. It defines three sub-protocols that can be mixed:

CXL.io

Device I/O

Non-coherent, similar to standard PCIe. Used for legacy-compatible device access.

CXL.cache

Device caching

Lets an accelerator coherently cache host memory. The device participates in the CPU's coherence domain.

CXL.mem

Memory expansion

Host-coherent access to device-attached memory. Enables memory pooling and tiering at rack scale.

CXL's killer feature is memory pooling — multiple hosts and accelerators sharing one coherent memory tier, addressable by all without software copies.

NVLink / NVSwitch — NVIDIA's GPU fabric

NVLink is NVIDIA's proprietary high-bandwidth interconnect between GPUs (and between GPU and CPU on Grace Hopper). NVSwitch is the fabric chip that connects up to 576 GPUs in full all-to-all topology. The bandwidth numbers are staggering:

900 GB/s

NVLink 4.0 bidirectional per GPU

28 GB/s

PCIe 4.0 peak (32×) — 32× slower

3.6 TB/s

NVSwitch 3.0 total bisection bandwidth

576

Max GPUs in an NVSwitch all-to-all domain

The NVLink/PCIe bandwidth gap is why KV cache transfers that are "free" within an NVLink domain become painful the moment they cross a PCIe or network boundary.

AMD Infinity Fabric

AMD's Infinity Fabric began as the die-to-die interconnect inside Zen CPUs and has evolved into a full system fabric connecting chiplets, compute dies, memory controllers, and I/O dies within a package and across sockets. In the CDNA3 GPU architecture (Instinct MI300), Infinity Fabric tightly integrates CPU and GPU dies with shared HBM, creating a genuinely unified memory space — the CPU and GPU share the same physical memory pool with hardware-managed coherence.

AMD MI300X

The MI300 series puts CPU chiplets and GPU compute dies on the same package sharing 192 GB HBM3. CPU and GPU see the same memory — no copy, true hardware coherence. This is the closest production system today to what CXL 3.x promises at rack scale.

CCIX, OpenCAPI, Gen-Z, and the standards landscape

Several industry standards have attempted to generalize cache-coherent interconnects beyond any single vendor: CCIX (Cache Coherent Interconnect for Accelerators), OpenCAPI, and Gen-Z. Many of their ideas have been absorbed into CXL, which has emerged as the primary industry standard. The direction of the industry is clear: accelerators should participate as first-class citizens in the coherence domain, not merely sit behind slow copy boundaries.

Fabric	Scope	Coherence	Peak BW	AI relevance
PCIe 5.0	Device ↔ CPU	None	~128 GB/s (x16)	Baseline; bottleneck for KV migration
CXL 3.x	CPU ↔ mem · accel	Full	~256 GB/s	Memory pooling, shared KV tiers
NVLink 4.0	GPU ↔ GPU · GPU ↔ CPU	Partial	900 GB/s	Fast KV transfer within GPU domain
Infinity Fabric	Die ↔ die (AMD)	Full	~900 GB/s	Unified CPU+GPU memory (MI300)
InfiniBand HDR	Node ↔ Node	None	200 Gb/s	Cross-node KV movement, RDMA

Section 08

How a coherent fabric actually behaves

At a high level, the fabric must answer a set of questions for every shared cache line in real time:

Who has a copy?

The system must track which caches or devices currently hold each cache line. This is the sharer list — kept either by snooping (broadcast and listen) or by a centralized directory.

Who modified it?

If one device has a dirty (Modified) copy, other devices cannot safely read an old version from memory. The dirty owner must write back or forward the line before others can read.

Who gets ownership next?

Writes often require exclusive ownership — the writer must first invalidate all other sharers. The protocol coordinates this transfer atomically.

Where is the latest copy?

The freshest copy may be in memory, in a CPU cache, in a GPU cache, or forwarded directly from another device. The protocol decides the shortest path.

A single cache line's life: CPU reads it (Exclusive), GPU reads it too (both Shared), GPU writes it (CPU Invalidated, GPU Modified). Every step is hardware-coordinated — zero software involvement.

In small systems, coherence can be handled with snooping protocols — every device listens to every transaction on the bus and reacts. This works at low device counts but becomes untenable at scale because broadcast traffic grows with every device added. At large scale, directory-based coherence is standard: a central directory (or distributed directories) maintains metadata about who shares or owns each cache line, and only notifies the relevant parties.

# Snooping protocol (small systems)
device broadcasts:  "I'm writing to address 0x4A2F"
everyone listens and reacts
→ O(n) broadcast traffic per write  # doesn't scale

# Directory protocol (large systems)
device tells directory: "I want to write 0x4A2F"
directory checks sharer list: [CPU0, GPU3]
directory sends invalidation only to CPU0 and GPU3
→ targeted messages, scales to thousands of nodes

Section 09

Power & danger: what coherence giveth

Why it is powerful

Reduces copies. Simplifies heterogeneous programming. Enables memory pooling. Makes accelerators first-class peers in the memory system. Unlocks prefix sharing for KV caches. Eliminates entire classes of synchronization bugs.

Why it is hard

Coherence traffic, invalidation storms, ownership ping-pong, NUMA latency cliffs, protocol deadlocks, security boundary complexity, and the fact that "shared" becomes a performance anti-pattern under write-heavy workloads.

The false-sharing trap

Cache lines are typically 64 bytes. If two devices each write to different bytes within the same cache line, the hardware still treats this as a conflict and forces serialization — even though the writes don't logically overlap. This is false sharing, and it can silently destroy performance in ways that are nearly invisible without hardware counter instrumentation.

# False sharing example
struct AgentState {
  uint64_t counter_a;  // bytes 0–7  — Agent A writes this
  uint64_t counter_b;  // bytes 8–15 — Agent B writes this
  // Both on same 64-byte cache line!
};

# Hardware sees:
Agent A writes → invalidates Agent B's copy
Agent B writes → invalidates Agent A's copy
→ cache line bounces between devices constantly
# Fix: pad to separate cache lines
alignas(64) uint64_t counter_a;
alignas(64) uint64_t counter_b;

Design principle

Not every tensor should be coherent. Not every memory object should be shared. AI systems must distinguish between hot mutable shared state (needs coherence) and cold read-mostly local state (coherence adds overhead with no benefit). The best-designed systems use coherence selectively, not universally.

Section 10

Coherent fabrics and KV-cache routing

KV-cache routing is the clearest AI-native application of everything we've discussed. It is where the hardware capabilities of coherent fabrics become a real scheduling and orchestration problem.

Suppose a long-running agent has built a 100k-token context. The KV cache for that context is resident on GPU 3. The next step is scheduled on GPU 7. Without fabric intelligence, the system faces a brutal choice:

Option A — Move KV

Migrate state to compute

Transfer hundreds of GB of KV cache from GPU 3 to GPU 7 before execution can begin. Latency is proportional to KV size. At 256 GB over PCIe: ~9 seconds. Over NVLink: ~0.3 seconds.

Option B — Move compute

Route request to the KV

The scheduler notices GPU 3 holds the KV, routes the next decode step to GPU 3 instead. Zero migration cost. Only works if GPU 3 has available compute capacity.

The optimal routing decision: steer requests toward the GPU with existing KV affinity, use CXL pool for the shared prefix, keep per-agent deltas local. The fabric makes sharing possible; the scheduler makes it happen.

A coherent fabric doesn't automatically solve KV routing — but it gives the scheduler better primitives: shared memory windows, peer access, coherent reads without explicit copies, and potentially a pooled memory tier where the shared prefix lives once and is readable by any GPU on the fabric.

# Without coherent fabric
request arrives for session_47 (KV on GPU 3)
scheduled on GPU 7
→ cudaMemcpy 256 GB  # PCIe: ~9s. NVLink: ~0.3s
→ execute decode
→ copy result back

# With coherent fabric + smart scheduler
request arrives for session_47
scheduler: KV affinity = GPU 3, pool_offset = 0x40000000
route to GPU 3  # or map CXL window to any GPU
GPU reads prefix coherently from pool  # no copy
→ execute decode  # latency: fabric access time, not copy time

# The policy question that remains:
move compute to KV   # works when KV is large, GPU has capacity
move KV to compute   # works when KV is small or NVLink is fast
share via pool       # works when many agents share prefix

Section 11 — Final Takeaway

The system architecture this points toward

The most interesting future AI servers may not look like "a CPU with attached memory and some GPUs." They may look like a memory fabric with compute engines attached — where the fabric is the primary coordination mechanism, and compute is a fungible resource dispatched to wherever the data already lives.

The memory-centric AI server. Compute is fungible and dispatched to data. The fabric coordinates consistency across a tiered memory hierarchy. The CPU orchestrates policy, not data movement.

CPU       →  orchestration, policy, scheduling
GPU       →  tensor compute (dispatched to where KV lives)
SmartNIC  →  data movement, routing, compression offload
CXL pool  →  expanded shared coherent memory tier
Fabric    →  consistency + transport + ownership + routing

A coherent fabric is best understood as a memory highway with traffic laws. The highway moves data at high bandwidth. The traffic laws determine who owns the latest copy, who must yield when there's a conflict, and how devices safely share state without programmer intervention. For agentic AI — where the workload is a branching, tool-using, memory-hungry loop — the winning systems will not merely have the fastest GPUs. They will have the best memory orchestration.

The thesis

As AI becomes agentic and long-context, coherent fabrics become one of the foundational infrastructure requirements for scalable AI memory systems. The companies and research teams that internalize this — and design their infrastructure around memory orchestration rather than raw compute — will have a durable advantage as context windows grow and agent systems multiply.