Systems Deep Dive · Connectivity Series

CXLCXL.memCXL.cache Memory DisaggregationPCIe Gen5 AI Memory Fabric

CXL Is Three Protocols in a Trenchcoat: What .io, .mem, and .cache Actually Do

MANISH AI · April 2026 · 19 min read · Connectivity Series

CXL gets discussed as a single technology that "extends memory." It is not one thing. It is three distinct protocols — CXL.io, CXL.mem, and CXL.cache — layered over PCIe Gen5, each solving a different connectivity problem. Understanding which sub-protocol does what is required to understand why CXL memory disaggregation works, what its limits are, and why coherence is both CXL's most powerful feature and its most dangerous assumption.

Key Numbers

CXL 2.0 bandwidth: ~64 GB/s per direction over PCIe Gen5 x16 (same physical link as PCIe)
CXL.mem access latency: ~170-250 ns vs. ~80 ns for local DDR5 — ~2-3× penalty
CXL 3.0 adds fabric switching: up to 4,096 devices in a shared memory pool
CXL.cache allows accelerator caches to participate in the host's MESI coherence domain
A single PCIe Gen5 x16 CXL link provides 8-10× more bandwidth than DDR5 DIMM bandwidth per channel

3distinct sub-protocols in CXL: .io, .mem, .cache

~2-3×latency penalty of CXL.mem vs. local DRAM

4,096max devices in a CXL 3.0 fabric pool

64 GB/speak per-direction bandwidth over PCIe Gen5 x16 CXL link

Contents

What CXL actually is — and why "memory extension" undersells it
CXL.io: the PCIe compatibility layer
CXL.mem: memory expansion without coherence
CXL.cache: bringing accelerator caches into the coherence domain
Device types: Type 1, 2, 3 and what each sub-protocol combination means
The latency reality: what 170-250 ns means for AI workloads
CXL 3.0 and fabric switching: from point-to-point to memory mesh
The coherence assumption: why .cache is powerful and dangerous
Which CXL sub-protocol matters for which AI use case
CXL in the AI memory hierarchy: where it actually fits

1. What CXL actually is — and why "memory extension" undersells it

Compute Express Link (CXL) is an open interconnect standard maintained by the CXL Consortium, built physically on top of the PCIe Gen5 physical layer. This is its first important property: CXL uses PCIe's physical signaling, connectors, and electrical specification. It is not a new physical layer — it is a new set of protocols layered over an existing physical standard. This means CXL devices can use PCIe's ecosystem of silicon, cabling, and connectors while implementing semantically richer protocols above the physical layer.

CXL defines three distinct protocols, each operating at a different layer of the memory hierarchy and providing different semantics:

CXL.io is a PCIe-compatible protocol that provides device discovery, configuration, and I/O access. It is essentially PCIe Gen5 with minor modifications. Every CXL device supports CXL.io — it is the baseline that makes CXL devices recognizable to PCIe host software.

CXL.mem is a protocol for host-initiated access to device-managed memory. The host CPU issues memory read and write operations to a CXL device's DRAM, and the device processes those operations and returns data. This is what enables "CXL memory expansion" — attaching additional DRAM capacity that the CPU can address directly with load/store instructions, as if it were regular system memory.

CXL.cache is a protocol for device-initiated access to host memory, with coherence. A CXL-attached accelerator can issue loads and stores to host DRAM, and those operations participate in the host CPU's cache coherence protocol — the same MESI protocol that governs how CPU cores share data. This is what enables GPU or AI accelerator caches to be coherent with host CPU caches.

The key insight: CXL.io is about device management. CXL.mem is about capacity expansion. CXL.cache is about coherence. They solve different problems and have different performance profiles. A "CXL device" can support any combination of the three. Understanding which combination you need is the prerequisite for CXL architecture decisions.

2. CXL.io: the PCIe compatibility layer

CXL.io is functionally equivalent to PCIe Gen5 with minor protocol modifications. It supports the same TLP (Transaction Layer Packet) structure, the same DLLP (Data Link Layer Packet) flow control, and the same configuration space layout. Every CXL device implements CXL.io — it is the mandatory baseline.

CXL.io matters primarily for device initialization, configuration, and management: reading device capabilities, configuring BARs (Base Address Registers), enabling interrupts, and performing DMA operations. For data plane operations — actually moving tensor data between CPU memory and a CXL-attached memory device — CXL.io is not used. CXL.mem handles those operations.

The CXL.io layer also provides the path for legacy software compatibility. A CXL device that supports CXL.io can be discovered and configured by any standard PCIe driver, even if the driver does not understand CXL.mem or CXL.cache. This backward compatibility is why CXL adoption can be incremental — existing software stacks work with CXL.io even before they are updated to exploit CXL.mem's expanded memory semantics.

3. CXL.mem: memory expansion without coherence

CXL.mem enables the host CPU to issue memory accesses to a CXL-attached device's memory. From the CPU's perspective, the CXL device's DRAM appears as regular system memory — it has physical addresses in the host's address space, load and store instructions can target it, and the OS memory allocator can place data in it. From the device's perspective, it receives memory requests over the CXL link, accesses its local DRAM, and returns responses.

CXL.mem is explicitly not coherent between devices. Multiple CXL.mem devices attached to the same host cannot see each other's writes. Each device sees only its own memory, and the host CPU is the single point through which coherence is maintained. This is sufficient for many use cases — KV cache expansion, weight staging, context storage — where the access pattern is the host CPU or GPU reading and writing a private memory region.

CXL.mem transaction flow — host read from CXL memory device

CPU issues load to physical address 0x2_0000_0000 (mapped to CXL device)

Host bridge:
  Detects address is in CXL.mem range
  Generates CXL.mem Read request: {tag, address, size}
  Sends over PCIe Gen5 x16 physical link

CXL device (Type 3 memory expander):
  Receives CXL.mem request at its port
  Issues DRAM read from local DDR5/LPDDR5 bank
  DRAM latency: ~80 ns local
  CXL link round-trip latency: ~80-120 ns additional
  Total: ~160-200 ns before host receives data

Host bridge:
  Receives CXL.mem Read Response
  Writes data to requesting CPU's cache
  CPU unblocks

Compare: local DDR5 DIMM access = ~70-85 ns total
         CXL.mem access = ~160-250 ns (device + link dependent)
         Penalty: ~2-3× — acceptable for cold capacity tier, not hot working set

The 2-3× latency penalty is the central constraint of CXL.mem. For memory that contains the hot working set of a computation — the data accessed repeatedly in a tight loop — this penalty accumulates. For memory that contains cold or infrequently accessed data — overflow KV pages, weight buffers for models that don't fit in HBM, context stores for long-running sessions — the penalty is acceptable because the alternative is a software copy from NVMe, which is far more expensive.

3.1 Bandwidth: CXL.mem vs. DDR5 DIMM

CXL.mem bandwidth is bounded by the PCIe Gen5 x16 physical link: approximately 64 GB/s in each direction, or 128 GB/s bidirectional. A DDR5 DIMM channel provides approximately 51 GB/s unidirectional bandwidth. A quad-channel DDR5 system provides ~200 GB/s total. On bandwidth alone, CXL.mem over a single link is competitive with 1-2 DDR5 channels but does not match a full multi-channel DDR5 configuration.

For AI inference use cases — specifically KV cache storage for long-context requests where the access pattern is large sequential reads rather than random accesses — CXL.mem bandwidth is often sufficient. The sequential read bandwidth of a CXL Type 3 device using LPDDR5 or DDR5 behind a well-designed controller can approach 50-60 GB/s, which is adequate for KV page prefetch if the prefetch scheduler has sufficient lead time.

4. CXL.cache: bringing accelerator caches into the coherence domain

CXL.cache is architecturally the most interesting and least discussed of the three sub-protocols. It allows a CXL-attached device (an AI accelerator, GPU, or smart NIC) to issue cache-coherent memory accesses to the host CPU's memory — and crucially, those accesses participate in the host's cache coherence protocol.

What this means concretely: if an accelerator has a cache (most modern AI accelerators have substantial on-chip SRAM), and that accelerator uses CXL.cache, the accelerator's cache lines can participate in MESI state transitions alongside CPU cache lines. The host CPU's cache coherence protocol (maintained by the home agent in the CPU uncore) knows about the accelerator's cached copies and will send coherence probes to the accelerator when another agent (another CPU core, or another CXL.cache device) writes to the same data.

CXL.cache coherence transaction — accelerator reads shared data

Scenario: CPU core 0 holds cache line X in Modified state
          AI accelerator issues read to address of X via CXL.cache

CXL.cache transaction flow:
1. Accelerator sends CXL.cache Rd (read) request to host home agent
2. Home agent snoops CPU core 0 (holds Modified copy)
3. CPU core 0 writes back cache line X to memory, transitions to Invalid
4. Home agent sends data to accelerator
5. Accelerator caches line X in Shared state
   — host memory has clean copy, accelerator has shared copy

Now: CPU core 0 issues write to address of X

CXL.cache snoop flow:
1. CPU write requires invalidating all Shared copies
2. Host home agent sends CXL.cache SnpInv (snoop invalidate) to accelerator
3. Accelerator must respond: if line is dirty, write it back; then invalidate
4. Only after accelerator's invalidation response does CPU core 0 proceed

Total coherence overhead: ~200-400 ns for cross-device coherence transaction

The coherence overhead — 200-400 ns per cross-device coherence transaction — is significant. It is 3-5× the latency of a local CPU cache snoop (~60-80 ns within a socket). This overhead is acceptable when coherence events are rare (the accelerator and CPU access different data most of the time) and becomes prohibitive when coherence events are frequent (the accelerator and CPU share a hot data structure).

4.1 Why this matters for AI accelerator design

CXL.cache is the mechanism that could eliminate the "bounce buffer" problem described in this series' prior essay. Currently, when a CPU needs to hand data to a GPU for processing, it must copy the data to a pinned buffer accessible to the GPU DMA engine. With CXL.cache, the GPU (if it implements CXL.cache) could simply read from the CPU's memory with coherent load operations — no explicit copy needed. The coherence protocol handles the transfer.

The practical barrier today is that most shipping GPU generations implement CXL.io (for device discovery) but not CXL.cache (for coherent access). Integrating CXL.cache into a GPU requires significant die area for the coherence agent logic and changes the GPU's memory model in ways that require software stack modifications. Future accelerator generations — particularly those targeting CPU-GPU shared memory programming models — are more likely to implement CXL.cache.

5. Device types: Type 1, 2, 3 and what each sub-protocol combination means

CXL Device Type	Protocols Supported	Primary Use Case	Example Devices
Type 1	CXL.io + CXL.cache	Accelerators that need coherent access to host memory but have no local DRAM	Smart NICs, FPGAs, security accelerators
Type 2	CXL.io + CXL.cache + CXL.mem	Accelerators with local memory that need bidirectional coherent access	Future GPUs, AI training ASICs
Type 3	CXL.io + CXL.mem	Memory expansion devices — add DRAM capacity to a host without GPU use	Samsung CMM-D, SK Hynix AiMM, Micron CZ120

The Type 3 memory expander is the device that has reached production first and is driving CXL adoption in AI clusters. Type 3 devices provide CXL.mem access to large DRAM capacities (current devices: 128 GB to 512 GB per module) at the 2-3× latency penalty. They are being deployed as KV cache expansion for long-context LLM inference, where the alternative is NVMe (far slower) or buying more GPUs (far more expensive).

Type 2 devices — GPUs or AI accelerators with full CXL.cache + CXL.mem — represent the future of CPU-GPU unified memory. They would allow a GPU to load tensors directly from CPU DRAM with coherent access, eliminating the explicit copy operations that currently waste PCIe bandwidth. No major GPU shipping today is a full CXL Type 2 device, though NVIDIA has announced intent to support CXL.mem in future Hopper+ generations and AMD is pursuing CXL Type 2 in their AI accelerator roadmap.

6. The latency reality: what 170-250 ns means for AI workloads

CXL.mem latency of 170-250 ns needs to be contextualized against the workloads that would use it:

Memory Tier	Access Latency	Bandwidth	AI Use Case
GPU HBM	~70-100 ns	3.35 TB/s (H200)	Active weights, hot KV cache, activations
Local DDR5 (host)	~80-100 ns	~200 GB/s (8-ch)	CPU-side buffers, overflow KV staging
CXL.mem (Type 3)	~170-250 ns	~50-64 GB/s per device	Cold KV cache, weight staging, context store
CXL fabric (3.0)	~300-500 ns	~30-50 GB/s effective	Shared KV pool across multiple hosts
NVMe SSD	~50,000-100,000 ns	~7-12 GB/s	Checkpoint, cold weight storage

Viewed against the alternatives, CXL.mem's latency is very good. It is 2-3× local DRAM — that sounds bad in isolation, but it is 200-400× better than NVMe. For KV cache overflow that would otherwise spill to NVMe, CXL.mem is a massive improvement. For data that should be in HBM but cannot fit, CXL.mem is a viable second tier.

The workloads where CXL.mem latency is problematic are those with random access patterns to hot data that cannot be predicted and prefetched. A transformer decode step that must access KV cache entries in an unpredictable order will stall on every CXL.mem access. A prefetch engine that can predict KV access patterns and issue prefetch requests 200+ ns in advance can hide the CXL latency entirely. This is why prefetch quality is the critical variable in CXL-augmented KV serving — and why the Memory Intent IR essay's concepts apply directly here.

7. CXL 3.0 and fabric switching: from point-to-point to memory mesh

CXL 1.0 and 2.0 are point-to-point protocols: one host connects to one device over a single PCIe link. CXL 2.0 added switching, allowing a single host to connect to multiple CXL devices through a CXL switch — but still with a single root host.

CXL 3.0, released in 2022 and with devices arriving in 2025-2026, changes the architecture fundamentally by adding fabric-level multi-host sharing. A CXL 3.0 memory device can be mapped into the address space of multiple hosts simultaneously. The coherence domain extends across all hosts sharing the device. This enables a rack-scale shared memory pool: a large DRAM device accessible by all servers in the rack, with CXL.cache coherence ensuring consistency.

CXL 3.0 fabric topology — shared memory pool

CXL 3.0 Fabric (example: 4 hosts, 8 TB shared memory pool)

Host 0 ─┐                    ┌─ CXL Memory Device A (512 GB)
Host 1 ─┤  CXL 3.0 Switch  ├─ CXL Memory Device B (512 GB)
Host 2 ─┤  (Fabric Manager)  ├─ CXL Memory Device C (512 GB)
Host 3 ─┘                    └─ CXL Memory Device D (512 GB)

Each host sees all 4 devices in its physical address space
Coherence: CXL.cache multi-host coherence via Back-Invalidation
           If Host 0 caches a line from Device A, and Host 2 writes to same line:
           Back-Invalidation: Device A sends invalidation to Host 0's cache
           Host 0 must evict the stale line before Host 2's write completes

Scale: CXL 3.0 spec supports up to 4,096 devices in a fabric pool
       Practical 2025-2026: ~8-16 devices per switch due to silicon limits

CXL 3.0's shared memory capability is particularly relevant for disaggregated KV serving. Instead of each inference server maintaining its own KV cache pool, multiple servers can access a shared CXL memory fabric. A KV page generated by server A serving a prefill request can be accessed by server B serving the subsequent decode request without any explicit network transfer — the KV page is in shared CXL memory and server B reads it directly with load instructions.

The coherence overhead of multi-host CXL 3.0 — the Back-Invalidation mechanism — adds latency to write operations compared to single-host CXL 2.0. For KV cache workloads that are predominantly read-heavy (KV pages are written once during prefill, then read many times during decode), this overhead is acceptable. The write path pays the coherence cost once; the read path is unaffected.

8. The coherence assumption: why .cache is powerful and dangerous

CXL.cache's coherence guarantee comes with an assumption: all participants in the coherence domain behave correctly. The CPU's coherence protocol is implemented in validated silicon with decades of testing. A CXL.cache device from a new vendor may have subtle bugs in its coherence agent implementation that cause incorrect behavior — data corruption, stale reads, or deadlock — in ways that are extraordinarily difficult to debug.

Coherence Bug Risk

CXL.cache coherence bugs produce silent data corruption — incorrect results that don't trigger exceptions, don't cause crashes, and don't generate error logs. A bug in a CXL.cache accelerator's coherence agent might cause it to serve stale cache lines in a narrow race condition that occurs once every million transactions. Finding this in production inference is nearly impossible without dedicated coherence testing infrastructure. This is the reason CXL.cache deployment should be treated with the same rigor as adding a new socket to a multi-processor system — the coherence agent must be validated before production deployment.

9. Which CXL sub-protocol matters for which AI use case

AI Use Case	Relevant Sub-Protocol	Why	Maturity
KV cache overflow (long context)	CXL.mem (Type 3)	Add DRAM capacity CPU-addressable; 2-3× latency acceptable vs NVMe	Production ready (2025)
Weight staging (large models)	CXL.mem (Type 3)	Store overflow weights; prefetch to GPU HBM ahead of decode step	Production ready (2025)
GPU-CPU unified memory	CXL.cache (Type 2)	GPU cache coherent with CPU; eliminates copy buffers	Future — no shipping Type 2 GPU today
Shared KV pool across servers	CXL.mem + CXL 3.0 fabric	Multi-host memory sharing enables KV reuse across server boundaries	Early production (2025-2026)
Smart NIC tensor offload	CXL.cache (Type 1)	NIC reads tensors directly from CPU memory without DMA	Emerging — some shipping NICs

10. CXL in the AI memory hierarchy: where it actually fits

CXL is not a replacement for HBM, DDR5, or NVMe. It is a new tier between local DDR5 and NVMe — faster than NVMe by 200-400×, slower than local DDR5 by 2-3×, and available in much larger capacities than HBM. For AI inference workloads that are fundamentally memory-capacity-bound — long-context KV serving, large model weight storage, session context persistence — CXL Type 3 memory fills a gap that has been genuinely painful with the existing two-tier hierarchy of HBM + NVMe.

The more significant long-term implications come from CXL 3.0 and Type 2 devices. CXL 3.0's shared fabric is a step toward the memory-disaggregated AI cluster — where memory is not physically attached to specific compute nodes but is a shared pool accessible across the rack. Type 2 devices are a step toward unified memory programming models where GPUs and CPUs share a coherent address space without explicit copy operations.

Both of these are three to five years from becoming infrastructure defaults. But the trajectory is clear: CXL is the protocol that will make memory disaggregation operational rather than theoretical. The current hardware — Type 3 devices from Samsung, SK Hynix, and Micron — is the first step. The protocol was designed for a much more ambitious destination.