Memory Systems · Coherence · Agentic AI

Coherent Fabrics:
The Memory Highways
Behind Agentic AI

The next bottleneck in AI is not raw GPU compute. It is how CPUs, GPUs, accelerators, NICs, and memory pools coordinate around gigantic, fast-moving state. Coherent fabrics are the hardware idea that tries to make that coordination feel like one shared memory system.

11 sections ~20 min read CXL · NVLink · MESI · KV-cache Deep technical
128k+
tokens in modern long-context windows
~256 GB
KV cache for one 128k session on a large model
~28 GB/s
PCIe 4.0 peak — copy bottleneck
~900 GB/s
NVLink 4.0 GPU-to-GPU bandwidth
<1 μs
CXL coherent memory access target latency

Why coherent fabrics matter now

AI infrastructure is undergoing a fundamental shift — from compute-centric to memory-centric. The GPU is increasingly the thing that waits, not the thing that bottlenecks.

For years the dominant AI story was linear: more GPUs, more FLOPS, bigger models, faster training. That equation still holds for raw throughput, but it breaks down the moment you zoom out to the full inference loop. Once models grow large, context windows lengthen to hundreds of thousands of tokens, and agentic systems start calling tools, reading codebases, retrieving documents, and managing multi-step workflows — the bottleneck quietly migrates.

The central shift

The question is no longer only "how fast can the GPU multiply matrices?" It is: where is the data, who owns it, how many copies exist, and how quickly can the next processor see the latest version?

That is exactly the territory of coherent fabrics. They sit underneath CPUs, GPUs, memory expansion devices, SmartNICs, and accelerators, trying to turn fragmented memory islands into something that behaves like a single coordinated system.

Memory bottleneck shift diagram Timeline showing the shift from GPU-compute bottleneck on the left to memory-movement bottleneck on the right, with milestones. 2017 Transformers 2020 GPT-3, 175B params 2022 RLHF · ChatGPT 2024 Long ctx · Agents COMPUTE bottleneck MEMORY bottleneck
The inflection point is roughly 2024, when long-context serving and agentic workflows made memory orchestration the dominant cost.
# Old bottleneck
GPU compute → buy more H100s → solved

# New bottleneck
memory movement + synchronization + orchestration
→ buying more GPUs often makes it worse
  (more devices = more copies = more coherence traffic)

First: what is a "fabric"?

In hardware systems, a fabric is the communication network connecting compute and memory components. It is the highway system between chips, sockets, accelerators, and racks. Not a single bus — a structured mesh of links, switches, and protocols.

Compute

Processors

CPUs, GPUs, DPUs, FPGAs, ASICs, custom AI accelerators. Anything that reads and writes memory to do work.

Memory

Storage layers

DRAM, HBM, CXL memory expanders, persistent memory, NVMe SSDs, memory pools attached over the fabric.

I/O

Movement engines

SmartNICs, DMA engines, PCIe switches, NVMe controllers. Things whose job is moving data efficiently.

Examples of fabrics include PCIe, NVLink/NVSwitch, AMD Infinity Fabric, and CXL-based fabrics. The critical distinction is what each fabric carries. Some are pure transport — they move bytes. Others carry memory semantics: ownership, visibility, validity. A coherent fabric participates in memory consistency, not just packet delivery.

Fabric topology diagram Shows a coherent fabric at center connected symmetrically to CPU, GPU, SmartNIC, and Memory Pool endpoints via bidirectional high-bandwidth links. Coherent Fabric ownership · consistency · transport CPU system RAM GPU HBM / VRAM SmartNIC DMA engines Mem Pool CXL attached
A coherent fabric is not just a cable — it is a structured interconnect plus a memory-ownership protocol shared by all attached devices.

Then: what does "coherent" mean?

Coherence is about making sure that when multiple processors cache the same memory location, they do not silently disagree about what the current value is.

Every modern CPU caches memory in L1/L2/L3 to avoid slow DRAM accesses. When you have one CPU, there's no conflict. When you have two CPUs sharing RAM, both can cache the same address simultaneously — and one can modify it without the other knowing. Same problem, amplified, when GPUs and accelerators enter the picture.

The cache coherence problem CPU cache holds X=5, GPU cache holds X=3 (stale). Without coherence, software must manually flush and synchronize. With coherence, hardware invalidates the stale copy automatically. CPU Cache X = 5 modified, not flushed GPU Cache X = 3 ? stale — from old read Main Memory X = 3 (also stale) Who is correct? Hardware doesn't know. Without coherence: programmer must manually flush · invalidate · copy · barrier
The coherence problem. CPU writes X=5 into its cache. GPU still holds the old value X=3. Without protocol enforcement, both believe they are correct.

Modern processors use cache coherence protocols to track the state of every cache line across all devices. The famous family of protocols is MESI and its variants. Before getting to those, note the key insight:

Coherence = hardware-assisted agreement about the latest memory state

# Without it:
programmer must:  flush → copy → barrier → invalidate → synchronize
# every handoff between devices is a bespoke software dance

# With it:
hardware tracks who has which copy
# and automatically invalidates or updates stale entries

The MESI protocol: how hardware tracks ownership

MESI is the dominant family of cache coherence protocols. The name is an acronym for the four states a cache line can be in. Understanding MESI gives you the mental model for every coherent fabric — even CXL and NVLink extensions ultimately build on these ideas.

State M

Modified

This cache has the only valid copy. It has been written. Main memory is stale. This cache owns write-back responsibility.

State E

Exclusive

This cache has the only copy. It matches main memory. No other cache holds it. Can be silently promoted to M on write.

State S

Shared

Multiple caches hold clean copies. All match memory. Any write must broadcast an invalidation to all sharers first.

State I

Invalid

This cache line is stale or evicted. Any access triggers a cache miss and a fetch from another cache or memory.

MESI state machine diagram State machine showing transitions between Modified, Exclusive, Shared, and Invalid cache line states with labeled transition conditions. M Modified E Exclusive S Shared I Invalid local write (silent) write-back other reads invalidation broadcast read miss, alone Solid = common fast paths · Dashed = cross-cache miss requiring bus/fabric transaction
The MESI state machine. Every cache line in every processor is always in exactly one of these four states. Coherence protocols are the rules governing transitions.

MOESI adds a fifth state — Owned — allowing a cache to satisfy another cache's read request directly (cache-to-cache transfer) without going to main memory first. This matters enormously for multi-GPU systems where you want GPUs to share data without bouncing through CPU memory.

Why this matters for AI

When two GPUs share a KV cache prefix, the coherence protocol determines whether they can read it simultaneously (Shared state), who must invalidate when one writes (I transition), and whether data can move GPU-to-GPU (MOESI Owned) or must round-trip through CPU RAM. The fabric and protocol together determine the actual bandwidth and latency the AI system sees.

Non-coherent vs coherent systems

The clearest way to understand coherent fabrics is to see what changes when you remove the coherence guarantee.

Non-coherent

Separate memory islands

# programmer manages everything
CPU RAM → copy → GPU VRAM
GPU writes result
GPU VRAM → copy → CPU RAM
CPU uses result

# plus manually:
flush · barrier · invalidate

Works, but every handoff is explicit programmer work. Two copies always exist. Latency is additive. Errors are silent.

Coherent

Shared address space

# hardware manages consistency
CPU and GPU access same address
Fabric tracks ownership
Stale copies auto-invalidated

# programmer writes:
ptr[i] = value   // just works

Reduces copies and synchronization overhead. Makes heterogeneous programming feel more natural. But not free — coherence traffic has cost.

Dimension Non-coherent Coherent fabric
Data sharing Explicit copies required Shared memory semantics possible
Synchronization Software: flush / invalidate / barrier Hardware-assisted visibility
Copy count O(n) copies for n devices Hardware can share single copy
Programming model Manual memory management (cudaMemcpy) Unified address space possible
Latency profile Predictable but high (copy time) Low average, occasional coherence stalls
Scaling challenge Copy overhead grows with device count Coherence traffic grows with sharer count
Best for Bulk, one-directional data movement Fine-grained shared mutable state

Why AI systems care so much

AI used to be mostly about weights. Agentic AI is increasingly about state.

Model weights are large but relatively static during inference — they're loaded once and read repeatedly. The dynamic picture is far messier. Every step of an agentic loop creates and consumes state: prompts, retrieved documents, embeddings, tool outputs, intermediate plans, logs, and most importantly — KV cache.

In transformer inference, the KV cache stores key/value tensors for every token in the current context. Subsequent tokens attend over these without recomputing. With a 128k-token context on a large model, the KV cache can be 200–400 GB per session — easily larger than the weights themselves.

KV cache size growth vs context length Line chart showing KV cache size in gigabytes growing steeply as context length increases from 4k to 128k tokens. A VRAM capacity line at ~80GB shows where KV cache size exceeds typical GPU VRAM. 400 GB 300 GB 200 GB 100 GB 0 4k 16k 32k 64k 128k context length (tokens) H100 VRAM (80 GB) KV spills VRAM ~48k tokens KV cache size (70B model, FP16)
KV cache grows roughly linearly with context length (per layer × per token). At 128k tokens on a 70B model, you're looking at 200–400 GB — far beyond a single GPU's VRAM.

Imagine two coding agents both working with the same 100k-token repository context. A naive system holds two full copies of the KV cache. A coherent-fabric-aware system routes both agents to the same memory, shares the read-only prefix, and only diverges state at the point the agents differ. The difference is 2× the memory, or 1×.

Once you have prefix sharing, you also need prefix routing — a scheduler that knows where each KV prefix lives and steers new requests toward it rather than migrating gigabytes of state. That is why coherent fabrics and KV routing are deeply coupled problems.

# Scale the problem
1 agent, 128k ctx  →  ~256 GB KV
100 agents, shared 100k prefix25.6 TB naive  vs  256 GB + 100 × delta

# The fabric makes sharing possible
# The scheduler makes sharing practical

Real-world fabrics and their tradeoffs

No single fabric solves every layer of the problem. Different vendors attack different segments of the memory hierarchy.

CXL — Compute Express Link

CXL is an industry-backed cache-coherent interconnect built on top of PCIe physical infrastructure, backed by Intel, AMD, NVIDIA, Samsung, and others. It defines three sub-protocols that can be mixed:

CXL.io

Device I/O

Non-coherent, similar to standard PCIe. Used for legacy-compatible device access.

CXL.cache

Device caching

Lets an accelerator coherently cache host memory. The device participates in the CPU's coherence domain.

CXL.mem

Memory expansion

Host-coherent access to device-attached memory. Enables memory pooling and tiering at rack scale.

CXL memory pooling diagram Shows multiple CPU sockets and GPU accelerators attached to a shared CXL memory pool via a CXL switch. Memory capacity is additive across all attached devices. CXL Switch fabric + routing CPU 0 512 GB DRAM CPU 1 512 GB DRAM GPU 0 80 GB HBM CXL Pool 2–8 TB Effective total visible memory: 1 TB DRAM + 160 GB HBM + 2–8 TB CXL pool
CXL's killer feature is memory pooling — multiple hosts and accelerators sharing one coherent memory tier, addressable by all without software copies.

NVLink / NVSwitch — NVIDIA's GPU fabric

NVLink is NVIDIA's proprietary high-bandwidth interconnect between GPUs (and between GPU and CPU on Grace Hopper). NVSwitch is the fabric chip that connects up to 576 GPUs in full all-to-all topology. The bandwidth numbers are staggering:

900 GB/s
NVLink 4.0 bidirectional per GPU
28 GB/s
PCIe 4.0 peak (32×) — 32× slower
3.6 TB/s
NVSwitch 3.0 total bisection bandwidth
576
Max GPUs in an NVSwitch all-to-all domain
NVLink vs PCIe bandwidth comparison Bar chart comparing NVLink 4.0 at 900 GB/s against PCIe 4.0 at 28 GB/s, showing a 32x difference in bandwidth. NVLink 4.0 900 GB/s PCIe 4.0 (x16) 28 GB/s NVLink is ~32× more bandwidth than PCIe — this is why intra-server GPU communication is a different class of problem
The NVLink/PCIe bandwidth gap is why KV cache transfers that are "free" within an NVLink domain become painful the moment they cross a PCIe or network boundary.

AMD Infinity Fabric

AMD's Infinity Fabric began as the die-to-die interconnect inside Zen CPUs and has evolved into a full system fabric connecting chiplets, compute dies, memory controllers, and I/O dies within a package and across sockets. In the CDNA3 GPU architecture (Instinct MI300), Infinity Fabric tightly integrates CPU and GPU dies with shared HBM, creating a genuinely unified memory space — the CPU and GPU share the same physical memory pool with hardware-managed coherence.

AMD MI300X

The MI300 series puts CPU chiplets and GPU compute dies on the same package sharing 192 GB HBM3. CPU and GPU see the same memory — no copy, true hardware coherence. This is the closest production system today to what CXL 3.x promises at rack scale.

CCIX, OpenCAPI, Gen-Z, and the standards landscape

Several industry standards have attempted to generalize cache-coherent interconnects beyond any single vendor: CCIX (Cache Coherent Interconnect for Accelerators), OpenCAPI, and Gen-Z. Many of their ideas have been absorbed into CXL, which has emerged as the primary industry standard. The direction of the industry is clear: accelerators should participate as first-class citizens in the coherence domain, not merely sit behind slow copy boundaries.

FabricScopeCoherencePeak BWAI relevance
PCIe 5.0 Device ↔ CPU None ~128 GB/s (x16) Baseline; bottleneck for KV migration
CXL 3.x CPU ↔ mem · accel Full ~256 GB/s Memory pooling, shared KV tiers
NVLink 4.0 GPU ↔ GPU · GPU ↔ CPU Partial 900 GB/s Fast KV transfer within GPU domain
Infinity Fabric Die ↔ die (AMD) Full ~900 GB/s Unified CPU+GPU memory (MI300)
InfiniBand HDR Node ↔ Node None 200 Gb/s Cross-node KV movement, RDMA

How a coherent fabric actually behaves

At a high level, the fabric must answer a set of questions for every shared cache line in real time:

Who has a copy?

The system must track which caches or devices currently hold each cache line. This is the sharer list — kept either by snooping (broadcast and listen) or by a centralized directory.

Who modified it?

If one device has a dirty (Modified) copy, other devices cannot safely read an old version from memory. The dirty owner must write back or forward the line before others can read.

Who gets ownership next?

Writes often require exclusive ownership — the writer must first invalidate all other sharers. The protocol coordinates this transfer atomically.

Where is the latest copy?

The freshest copy may be in memory, in a CPU cache, in a GPU cache, or forwarded directly from another device. The protocol decides the shortest path.

Directory-based coherence protocol flow Diagram showing a cache line's ownership journey: CPU reads (I→E), GPU also reads (E→S for both), GPU writes (S→I for CPU, S→M for GPU), CPU re-reads triggering a write-back and transfer. CPU Directory GPU read miss data + Exclusive state: E read miss downgrade to Shared data + Shared state: S state: S write request Invalidate! state: I state: M Sequence based on directory-based MESI — actual protocols add acknowledgements and error handling
A single cache line's life: CPU reads it (Exclusive), GPU reads it too (both Shared), GPU writes it (CPU Invalidated, GPU Modified). Every step is hardware-coordinated — zero software involvement.

In small systems, coherence can be handled with snooping protocols — every device listens to every transaction on the bus and reacts. This works at low device counts but becomes untenable at scale because broadcast traffic grows with every device added. At large scale, directory-based coherence is standard: a central directory (or distributed directories) maintains metadata about who shares or owns each cache line, and only notifies the relevant parties.

# Snooping protocol (small systems)
device broadcasts:  "I'm writing to address 0x4A2F"
everyone listens and reacts
→ O(n) broadcast traffic per write  # doesn't scale

# Directory protocol (large systems)
device tells directory: "I want to write 0x4A2F"
directory checks sharer list: [CPU0, GPU3]
directory sends invalidation only to CPU0 and GPU3
→ targeted messages, scales to thousands of nodes

Power & danger: what coherence giveth

Why it is powerful

Reduces copies. Simplifies heterogeneous programming. Enables memory pooling. Makes accelerators first-class peers in the memory system. Unlocks prefix sharing for KV caches. Eliminates entire classes of synchronization bugs.

Why it is hard

Coherence traffic, invalidation storms, ownership ping-pong, NUMA latency cliffs, protocol deadlocks, security boundary complexity, and the fact that "shared" becomes a performance anti-pattern under write-heavy workloads.

The false-sharing trap

Cache lines are typically 64 bytes. If two devices each write to different bytes within the same cache line, the hardware still treats this as a conflict and forces serialization — even though the writes don't logically overlap. This is false sharing, and it can silently destroy performance in ways that are nearly invisible without hardware counter instrumentation.

# False sharing example
struct AgentState {
  uint64_t counter_a;  // bytes 0–7  — Agent A writes this
  uint64_t counter_b;  // bytes 8–15 — Agent B writes this
  // Both on same 64-byte cache line!
};

# Hardware sees:
Agent A writes → invalidates Agent B's copy
Agent B writes → invalidates Agent A's copy
→ cache line bounces between devices constantly
# Fix: pad to separate cache lines
alignas(64) uint64_t counter_a;
alignas(64) uint64_t counter_b;
Design principle

Not every tensor should be coherent. Not every memory object should be shared. AI systems must distinguish between hot mutable shared state (needs coherence) and cold read-mostly local state (coherence adds overhead with no benefit). The best-designed systems use coherence selectively, not universally.

Coherent fabrics and KV-cache routing

KV-cache routing is the clearest AI-native application of everything we've discussed. It is where the hardware capabilities of coherent fabrics become a real scheduling and orchestration problem.

Suppose a long-running agent has built a 100k-token context. The KV cache for that context is resident on GPU 3. The next step is scheduled on GPU 7. Without fabric intelligence, the system faces a brutal choice:

Option A — Move KV

Migrate state to compute

Transfer hundreds of GB of KV cache from GPU 3 to GPU 7 before execution can begin. Latency is proportional to KV size. At 256 GB over PCIe: ~9 seconds. Over NVLink: ~0.3 seconds.

Option B — Move compute

Route request to the KV

The scheduler notices GPU 3 holds the KV, routes the next decode step to GPU 3 instead. Zero migration cost. Only works if GPU 3 has available compute capacity.

KV cache routing over a memory-aware fabric Shows three GPUs connected to a shared CXL memory pool. A shared KV prefix sits in the pool, accessible by all GPUs. Agent A and Agent B are routed to GPUs that already have access to the shared prefix. GPU 1 delta KV only GPU 3 KV prefix resident route requests here GPU 7 no KV — idle Shared CXL Memory Pool shared 100k-token KV prefix (256 GB) — one copy, all GPUs can read CXL.cache: GPU reads coherently without explicit copy Agent A → ← Agent B Scheduler routes both agents to GPU 3 — shared prefix read coherently from pool — zero copy overhead
The optimal routing decision: steer requests toward the GPU with existing KV affinity, use CXL pool for the shared prefix, keep per-agent deltas local. The fabric makes sharing possible; the scheduler makes it happen.

A coherent fabric doesn't automatically solve KV routing — but it gives the scheduler better primitives: shared memory windows, peer access, coherent reads without explicit copies, and potentially a pooled memory tier where the shared prefix lives once and is readable by any GPU on the fabric.

# Without coherent fabric
request arrives for session_47 (KV on GPU 3)
scheduled on GPU 7
→ cudaMemcpy 256 GB  # PCIe: ~9s. NVLink: ~0.3s
→ execute decode
→ copy result back

# With coherent fabric + smart scheduler
request arrives for session_47
scheduler: KV affinity = GPU 3, pool_offset = 0x40000000
route to GPU 3  # or map CXL window to any GPU
GPU reads prefix coherently from pool  # no copy
→ execute decode  # latency: fabric access time, not copy time

# The policy question that remains:
move compute to KV   # works when KV is large, GPU has capacity
move KV to compute   # works when KV is small or NVLink is fast
share via pool       # works when many agents share prefix

The system architecture this points toward

The most interesting future AI servers may not look like "a CPU with attached memory and some GPUs." They may look like a memory fabric with compute engines attached — where the fabric is the primary coordination mechanism, and compute is a fungible resource dispatched to wherever the data already lives.

Future memory-centric AI server architecture Diagram showing memory fabric at center with compute engines (CPU, GPU pool, SmartNIC, accelerators) as attached clients. Arrows show all compute accessing the central fabric symmetrically. Coherent Memory Fabric consistency · transport · ownership · routing CPU orchestration GPU ×N tensor compute SmartNIC data movement Accel encode/decode HBM pool hot tier CXL DRAM warm tier NVMe SSD cold tier
The memory-centric AI server. Compute is fungible and dispatched to data. The fabric coordinates consistency across a tiered memory hierarchy. The CPU orchestrates policy, not data movement.
CPU       →  orchestration, policy, scheduling
GPU       →  tensor compute (dispatched to where KV lives)
SmartNIC  →  data movement, routing, compression offload
CXL pool  →  expanded shared coherent memory tier
Fabric    →  consistency + transport + ownership + routing

A coherent fabric is best understood as a memory highway with traffic laws. The highway moves data at high bandwidth. The traffic laws determine who owns the latest copy, who must yield when there's a conflict, and how devices safely share state without programmer intervention. For agentic AI — where the workload is a branching, tool-using, memory-hungry loop — the winning systems will not merely have the fastest GPUs. They will have the best memory orchestration.

The thesis

As AI becomes agentic and long-context, coherent fabrics become one of the foundational infrastructure requirements for scalable AI memory systems. The companies and research teams that internalize this — and design their infrastructure around memory orchestration rather than raw compute — will have a durable advantage as context windows grow and agent systems multiply.