Coherent Fabrics:
The Memory Highways
Behind Agentic AI
The next bottleneck in AI is not raw GPU compute. It is how CPUs, GPUs, accelerators, NICs, and memory pools coordinate around gigantic, fast-moving state. Coherent fabrics are the hardware idea that tries to make that coordination feel like one shared memory system.
Why coherent fabrics matter now
AI infrastructure is undergoing a fundamental shift — from compute-centric to memory-centric. The GPU is increasingly the thing that waits, not the thing that bottlenecks.
For years the dominant AI story was linear: more GPUs, more FLOPS, bigger models, faster training. That equation still holds for raw throughput, but it breaks down the moment you zoom out to the full inference loop. Once models grow large, context windows lengthen to hundreds of thousands of tokens, and agentic systems start calling tools, reading codebases, retrieving documents, and managing multi-step workflows — the bottleneck quietly migrates.
The question is no longer only "how fast can the GPU multiply matrices?" It is: where is the data, who owns it, how many copies exist, and how quickly can the next processor see the latest version?
That is exactly the territory of coherent fabrics. They sit underneath CPUs, GPUs, memory expansion devices, SmartNICs, and accelerators, trying to turn fragmented memory islands into something that behaves like a single coordinated system.
# Old bottleneck GPU compute → buy more H100s → solved # New bottleneck memory movement + synchronization + orchestration → buying more GPUs often makes it worse (more devices = more copies = more coherence traffic)
First: what is a "fabric"?
In hardware systems, a fabric is the communication network connecting compute and memory components. It is the highway system between chips, sockets, accelerators, and racks. Not a single bus — a structured mesh of links, switches, and protocols.
Processors
CPUs, GPUs, DPUs, FPGAs, ASICs, custom AI accelerators. Anything that reads and writes memory to do work.
Storage layers
DRAM, HBM, CXL memory expanders, persistent memory, NVMe SSDs, memory pools attached over the fabric.
Movement engines
SmartNICs, DMA engines, PCIe switches, NVMe controllers. Things whose job is moving data efficiently.
Examples of fabrics include PCIe, NVLink/NVSwitch, AMD Infinity Fabric, and CXL-based fabrics. The critical distinction is what each fabric carries. Some are pure transport — they move bytes. Others carry memory semantics: ownership, visibility, validity. A coherent fabric participates in memory consistency, not just packet delivery.
Then: what does "coherent" mean?
Coherence is about making sure that when multiple processors cache the same memory location, they do not silently disagree about what the current value is.
Every modern CPU caches memory in L1/L2/L3 to avoid slow DRAM accesses. When you have one CPU, there's no conflict. When you have two CPUs sharing RAM, both can cache the same address simultaneously — and one can modify it without the other knowing. Same problem, amplified, when GPUs and accelerators enter the picture.
Modern processors use cache coherence protocols to track the state of every cache line across all devices. The famous family of protocols is MESI and its variants. Before getting to those, note the key insight:
Coherence = hardware-assisted agreement about the latest memory state # Without it: programmer must: flush → copy → barrier → invalidate → synchronize # every handoff between devices is a bespoke software dance # With it: hardware tracks who has which copy # and automatically invalidates or updates stale entries
The MESI protocol: how hardware tracks ownership
MESI is the dominant family of cache coherence protocols. The name is an acronym for the four states a cache line can be in. Understanding MESI gives you the mental model for every coherent fabric — even CXL and NVLink extensions ultimately build on these ideas.
Modified
This cache has the only valid copy. It has been written. Main memory is stale. This cache owns write-back responsibility.
Exclusive
This cache has the only copy. It matches main memory. No other cache holds it. Can be silently promoted to M on write.
Shared
Multiple caches hold clean copies. All match memory. Any write must broadcast an invalidation to all sharers first.
Invalid
This cache line is stale or evicted. Any access triggers a cache miss and a fetch from another cache or memory.
MOESI adds a fifth state — Owned — allowing a cache to satisfy another cache's read request directly (cache-to-cache transfer) without going to main memory first. This matters enormously for multi-GPU systems where you want GPUs to share data without bouncing through CPU memory.
When two GPUs share a KV cache prefix, the coherence protocol determines whether they can read it simultaneously (Shared state), who must invalidate when one writes (I transition), and whether data can move GPU-to-GPU (MOESI Owned) or must round-trip through CPU RAM. The fabric and protocol together determine the actual bandwidth and latency the AI system sees.
Non-coherent vs coherent systems
The clearest way to understand coherent fabrics is to see what changes when you remove the coherence guarantee.
Separate memory islands
# programmer manages everything CPU RAM → copy → GPU VRAM GPU writes result GPU VRAM → copy → CPU RAM CPU uses result # plus manually: flush · barrier · invalidate
Works, but every handoff is explicit programmer work. Two copies always exist. Latency is additive. Errors are silent.
Shared address space
# hardware manages consistency CPU and GPU access same address Fabric tracks ownership Stale copies auto-invalidated # programmer writes: ptr[i] = value // just works
Reduces copies and synchronization overhead. Makes heterogeneous programming feel more natural. But not free — coherence traffic has cost.
| Dimension | Non-coherent | Coherent fabric |
|---|---|---|
| Data sharing | Explicit copies required | Shared memory semantics possible |
| Synchronization | Software: flush / invalidate / barrier | Hardware-assisted visibility |
| Copy count | O(n) copies for n devices | Hardware can share single copy |
| Programming model | Manual memory management (cudaMemcpy) |
Unified address space possible |
| Latency profile | Predictable but high (copy time) | Low average, occasional coherence stalls |
| Scaling challenge | Copy overhead grows with device count | Coherence traffic grows with sharer count |
| Best for | Bulk, one-directional data movement | Fine-grained shared mutable state |
Why AI systems care so much
AI used to be mostly about weights. Agentic AI is increasingly about state.
Model weights are large but relatively static during inference — they're loaded once and read repeatedly. The dynamic picture is far messier. Every step of an agentic loop creates and consumes state: prompts, retrieved documents, embeddings, tool outputs, intermediate plans, logs, and most importantly — KV cache.
In transformer inference, the KV cache stores key/value tensors for every token in the current context. Subsequent tokens attend over these without recomputing. With a 128k-token context on a large model, the KV cache can be 200–400 GB per session — easily larger than the weights themselves.
Imagine two coding agents both working with the same 100k-token repository context. A naive system holds two full copies of the KV cache. A coherent-fabric-aware system routes both agents to the same memory, shares the read-only prefix, and only diverges state at the point the agents differ. The difference is 2× the memory, or 1×.
Once you have prefix sharing, you also need prefix routing — a scheduler that knows where each KV prefix lives and steers new requests toward it rather than migrating gigabytes of state. That is why coherent fabrics and KV routing are deeply coupled problems.
# Scale the problem 1 agent, 128k ctx → ~256 GB KV 100 agents, shared 100k prefix → 25.6 TB naive vs 256 GB + 100 × delta # The fabric makes sharing possible # The scheduler makes sharing practical
Real-world fabrics and their tradeoffs
No single fabric solves every layer of the problem. Different vendors attack different segments of the memory hierarchy.
CXL — Compute Express Link
CXL is an industry-backed cache-coherent interconnect built on top of PCIe physical infrastructure, backed by Intel, AMD, NVIDIA, Samsung, and others. It defines three sub-protocols that can be mixed:
Device I/O
Non-coherent, similar to standard PCIe. Used for legacy-compatible device access.
Device caching
Lets an accelerator coherently cache host memory. The device participates in the CPU's coherence domain.
Memory expansion
Host-coherent access to device-attached memory. Enables memory pooling and tiering at rack scale.
NVLink / NVSwitch — NVIDIA's GPU fabric
NVLink is NVIDIA's proprietary high-bandwidth interconnect between GPUs (and between GPU and CPU on Grace Hopper). NVSwitch is the fabric chip that connects up to 576 GPUs in full all-to-all topology. The bandwidth numbers are staggering:
AMD Infinity Fabric
AMD's Infinity Fabric began as the die-to-die interconnect inside Zen CPUs and has evolved into a full system fabric connecting chiplets, compute dies, memory controllers, and I/O dies within a package and across sockets. In the CDNA3 GPU architecture (Instinct MI300), Infinity Fabric tightly integrates CPU and GPU dies with shared HBM, creating a genuinely unified memory space — the CPU and GPU share the same physical memory pool with hardware-managed coherence.
The MI300 series puts CPU chiplets and GPU compute dies on the same package sharing 192 GB HBM3. CPU and GPU see the same memory — no copy, true hardware coherence. This is the closest production system today to what CXL 3.x promises at rack scale.
CCIX, OpenCAPI, Gen-Z, and the standards landscape
Several industry standards have attempted to generalize cache-coherent interconnects beyond any single vendor: CCIX (Cache Coherent Interconnect for Accelerators), OpenCAPI, and Gen-Z. Many of their ideas have been absorbed into CXL, which has emerged as the primary industry standard. The direction of the industry is clear: accelerators should participate as first-class citizens in the coherence domain, not merely sit behind slow copy boundaries.
| Fabric | Scope | Coherence | Peak BW | AI relevance |
|---|---|---|---|---|
| PCIe 5.0 | Device ↔ CPU | None | ~128 GB/s (x16) | Baseline; bottleneck for KV migration |
| CXL 3.x | CPU ↔ mem · accel | Full | ~256 GB/s | Memory pooling, shared KV tiers |
| NVLink 4.0 | GPU ↔ GPU · GPU ↔ CPU | Partial | 900 GB/s | Fast KV transfer within GPU domain |
| Infinity Fabric | Die ↔ die (AMD) | Full | ~900 GB/s | Unified CPU+GPU memory (MI300) |
| InfiniBand HDR | Node ↔ Node | None | 200 Gb/s | Cross-node KV movement, RDMA |
How a coherent fabric actually behaves
At a high level, the fabric must answer a set of questions for every shared cache line in real time:
Who has a copy?
The system must track which caches or devices currently hold each cache line. This is the sharer list — kept either by snooping (broadcast and listen) or by a centralized directory.
Who modified it?
If one device has a dirty (Modified) copy, other devices cannot safely read an old version from memory. The dirty owner must write back or forward the line before others can read.
Who gets ownership next?
Writes often require exclusive ownership — the writer must first invalidate all other sharers. The protocol coordinates this transfer atomically.
Where is the latest copy?
The freshest copy may be in memory, in a CPU cache, in a GPU cache, or forwarded directly from another device. The protocol decides the shortest path.
In small systems, coherence can be handled with snooping protocols — every device listens to every transaction on the bus and reacts. This works at low device counts but becomes untenable at scale because broadcast traffic grows with every device added. At large scale, directory-based coherence is standard: a central directory (or distributed directories) maintains metadata about who shares or owns each cache line, and only notifies the relevant parties.
# Snooping protocol (small systems) device broadcasts: "I'm writing to address 0x4A2F" everyone listens and reacts → O(n) broadcast traffic per write # doesn't scale # Directory protocol (large systems) device tells directory: "I want to write 0x4A2F" directory checks sharer list: [CPU0, GPU3] directory sends invalidation only to CPU0 and GPU3 → targeted messages, scales to thousands of nodes
Power & danger: what coherence giveth
Reduces copies. Simplifies heterogeneous programming. Enables memory pooling. Makes accelerators first-class peers in the memory system. Unlocks prefix sharing for KV caches. Eliminates entire classes of synchronization bugs.
Coherence traffic, invalidation storms, ownership ping-pong, NUMA latency cliffs, protocol deadlocks, security boundary complexity, and the fact that "shared" becomes a performance anti-pattern under write-heavy workloads.
The false-sharing trap
Cache lines are typically 64 bytes. If two devices each write to different bytes within the same cache line, the hardware still treats this as a conflict and forces serialization — even though the writes don't logically overlap. This is false sharing, and it can silently destroy performance in ways that are nearly invisible without hardware counter instrumentation.
# False sharing example struct AgentState { uint64_t counter_a; // bytes 0–7 — Agent A writes this uint64_t counter_b; // bytes 8–15 — Agent B writes this // Both on same 64-byte cache line! }; # Hardware sees: Agent A writes → invalidates Agent B's copy Agent B writes → invalidates Agent A's copy → cache line bounces between devices constantly # Fix: pad to separate cache lines alignas(64) uint64_t counter_a; alignas(64) uint64_t counter_b;
Not every tensor should be coherent. Not every memory object should be shared. AI systems must distinguish between hot mutable shared state (needs coherence) and cold read-mostly local state (coherence adds overhead with no benefit). The best-designed systems use coherence selectively, not universally.
Coherent fabrics and KV-cache routing
KV-cache routing is the clearest AI-native application of everything we've discussed. It is where the hardware capabilities of coherent fabrics become a real scheduling and orchestration problem.
Suppose a long-running agent has built a 100k-token context. The KV cache for that context is resident on GPU 3. The next step is scheduled on GPU 7. Without fabric intelligence, the system faces a brutal choice:
Migrate state to compute
Transfer hundreds of GB of KV cache from GPU 3 to GPU 7 before execution can begin. Latency is proportional to KV size. At 256 GB over PCIe: ~9 seconds. Over NVLink: ~0.3 seconds.
Route request to the KV
The scheduler notices GPU 3 holds the KV, routes the next decode step to GPU 3 instead. Zero migration cost. Only works if GPU 3 has available compute capacity.
A coherent fabric doesn't automatically solve KV routing — but it gives the scheduler better primitives: shared memory windows, peer access, coherent reads without explicit copies, and potentially a pooled memory tier where the shared prefix lives once and is readable by any GPU on the fabric.
# Without coherent fabric request arrives for session_47 (KV on GPU 3) scheduled on GPU 7 → cudaMemcpy 256 GB # PCIe: ~9s. NVLink: ~0.3s → execute decode → copy result back # With coherent fabric + smart scheduler request arrives for session_47 scheduler: KV affinity = GPU 3, pool_offset = 0x40000000 route to GPU 3 # or map CXL window to any GPU GPU reads prefix coherently from pool # no copy → execute decode # latency: fabric access time, not copy time # The policy question that remains: move compute to KV # works when KV is large, GPU has capacity move KV to compute # works when KV is small or NVLink is fast share via pool # works when many agents share prefix
The system architecture this points toward
The most interesting future AI servers may not look like "a CPU with attached memory and some GPUs." They may look like a memory fabric with compute engines attached — where the fabric is the primary coordination mechanism, and compute is a fungible resource dispatched to wherever the data already lives.
CPU → orchestration, policy, scheduling GPU → tensor compute (dispatched to where KV lives) SmartNIC → data movement, routing, compression offload CXL pool → expanded shared coherent memory tier Fabric → consistency + transport + ownership + routing
A coherent fabric is best understood as a memory highway with traffic laws. The highway moves data at high bandwidth. The traffic laws determine who owns the latest copy, who must yield when there's a conflict, and how devices safely share state without programmer intervention. For agentic AI — where the workload is a branching, tool-using, memory-hungry loop — the winning systems will not merely have the fastest GPUs. They will have the best memory orchestration.
As AI becomes agentic and long-context, coherent fabrics become one of the foundational infrastructure requirements for scalable AI memory systems. The companies and research teams that internalize this — and design their infrastructure around memory orchestration rather than raw compute — will have a durable advantage as context windows grow and agent systems multiply.