CXL Is Three Protocols in a Trenchcoat: What .io, .mem, and .cache Actually Do
CXL gets discussed as a single technology that "extends memory." It is not one thing. It is three distinct protocols — CXL.io, CXL.mem, and CXL.cache — layered over PCIe Gen5, each solving a different connectivity problem. Understanding which sub-protocol does what is required to understand why CXL memory disaggregation works, what its limits are, and why coherence is both CXL's most powerful feature and its most dangerous assumption.
- CXL 2.0 bandwidth: ~64 GB/s per direction over PCIe Gen5 x16 (same physical link as PCIe)
- CXL.mem access latency: ~170-250 ns vs. ~80 ns for local DDR5 — ~2-3× penalty
- CXL 3.0 adds fabric switching: up to 4,096 devices in a shared memory pool
- CXL.cache allows accelerator caches to participate in the host's MESI coherence domain
- A single PCIe Gen5 x16 CXL link provides 8-10× more bandwidth than DDR5 DIMM bandwidth per channel
- What CXL actually is — and why "memory extension" undersells it
- CXL.io: the PCIe compatibility layer
- CXL.mem: memory expansion without coherence
- CXL.cache: bringing accelerator caches into the coherence domain
- Device types: Type 1, 2, 3 and what each sub-protocol combination means
- The latency reality: what 170-250 ns means for AI workloads
- CXL 3.0 and fabric switching: from point-to-point to memory mesh
- The coherence assumption: why .cache is powerful and dangerous
- Which CXL sub-protocol matters for which AI use case
- CXL in the AI memory hierarchy: where it actually fits
1. What CXL actually is — and why "memory extension" undersells it
Compute Express Link (CXL) is an open interconnect standard maintained by the CXL Consortium, built physically on top of the PCIe Gen5 physical layer. This is its first important property: CXL uses PCIe's physical signaling, connectors, and electrical specification. It is not a new physical layer — it is a new set of protocols layered over an existing physical standard. This means CXL devices can use PCIe's ecosystem of silicon, cabling, and connectors while implementing semantically richer protocols above the physical layer.
CXL defines three distinct protocols, each operating at a different layer of the memory hierarchy and providing different semantics:
CXL.io is a PCIe-compatible protocol that provides device discovery, configuration, and I/O access. It is essentially PCIe Gen5 with minor modifications. Every CXL device supports CXL.io — it is the baseline that makes CXL devices recognizable to PCIe host software.
CXL.mem is a protocol for host-initiated access to device-managed memory. The host CPU issues memory read and write operations to a CXL device's DRAM, and the device processes those operations and returns data. This is what enables "CXL memory expansion" — attaching additional DRAM capacity that the CPU can address directly with load/store instructions, as if it were regular system memory.
CXL.cache is a protocol for device-initiated access to host memory, with coherence. A CXL-attached accelerator can issue loads and stores to host DRAM, and those operations participate in the host CPU's cache coherence protocol — the same MESI protocol that governs how CPU cores share data. This is what enables GPU or AI accelerator caches to be coherent with host CPU caches.
The key insight: CXL.io is about device management. CXL.mem is about capacity expansion. CXL.cache is about coherence. They solve different problems and have different performance profiles. A "CXL device" can support any combination of the three. Understanding which combination you need is the prerequisite for CXL architecture decisions.
2. CXL.io: the PCIe compatibility layer
CXL.io is functionally equivalent to PCIe Gen5 with minor protocol modifications. It supports the same TLP (Transaction Layer Packet) structure, the same DLLP (Data Link Layer Packet) flow control, and the same configuration space layout. Every CXL device implements CXL.io — it is the mandatory baseline.
CXL.io matters primarily for device initialization, configuration, and management: reading device capabilities, configuring BARs (Base Address Registers), enabling interrupts, and performing DMA operations. For data plane operations — actually moving tensor data between CPU memory and a CXL-attached memory device — CXL.io is not used. CXL.mem handles those operations.
The CXL.io layer also provides the path for legacy software compatibility. A CXL device that supports CXL.io can be discovered and configured by any standard PCIe driver, even if the driver does not understand CXL.mem or CXL.cache. This backward compatibility is why CXL adoption can be incremental — existing software stacks work with CXL.io even before they are updated to exploit CXL.mem's expanded memory semantics.
3. CXL.mem: memory expansion without coherence
CXL.mem enables the host CPU to issue memory accesses to a CXL-attached device's memory. From the CPU's perspective, the CXL device's DRAM appears as regular system memory — it has physical addresses in the host's address space, load and store instructions can target it, and the OS memory allocator can place data in it. From the device's perspective, it receives memory requests over the CXL link, accesses its local DRAM, and returns responses.
CXL.mem is explicitly not coherent between devices. Multiple CXL.mem devices attached to the same host cannot see each other's writes. Each device sees only its own memory, and the host CPU is the single point through which coherence is maintained. This is sufficient for many use cases — KV cache expansion, weight staging, context storage — where the access pattern is the host CPU or GPU reading and writing a private memory region.
CXL.mem transaction flow — host read from CXL memory deviceCPU issues load to physical address 0x2_0000_0000 (mapped to CXL device)
Host bridge:
Detects address is in CXL.mem range
Generates CXL.mem Read request: {tag, address, size}
Sends over PCIe Gen5 x16 physical link
CXL device (Type 3 memory expander):
Receives CXL.mem request at its port
Issues DRAM read from local DDR5/LPDDR5 bank
DRAM latency: ~80 ns local
CXL link round-trip latency: ~80-120 ns additional
Total: ~160-200 ns before host receives data
Host bridge:
Receives CXL.mem Read Response
Writes data to requesting CPU's cache
CPU unblocks
Compare: local DDR5 DIMM access = ~70-85 ns total
CXL.mem access = ~160-250 ns (device + link dependent)
Penalty: ~2-3× — acceptable for cold capacity tier, not hot working set
The 2-3× latency penalty is the central constraint of CXL.mem. For memory that contains the hot working set of a computation — the data accessed repeatedly in a tight loop — this penalty accumulates. For memory that contains cold or infrequently accessed data — overflow KV pages, weight buffers for models that don't fit in HBM, context stores for long-running sessions — the penalty is acceptable because the alternative is a software copy from NVMe, which is far more expensive.
3.1 Bandwidth: CXL.mem vs. DDR5 DIMM
CXL.mem bandwidth is bounded by the PCIe Gen5 x16 physical link: approximately 64 GB/s in each direction, or 128 GB/s bidirectional. A DDR5 DIMM channel provides approximately 51 GB/s unidirectional bandwidth. A quad-channel DDR5 system provides ~200 GB/s total. On bandwidth alone, CXL.mem over a single link is competitive with 1-2 DDR5 channels but does not match a full multi-channel DDR5 configuration.
For AI inference use cases — specifically KV cache storage for long-context requests where the access pattern is large sequential reads rather than random accesses — CXL.mem bandwidth is often sufficient. The sequential read bandwidth of a CXL Type 3 device using LPDDR5 or DDR5 behind a well-designed controller can approach 50-60 GB/s, which is adequate for KV page prefetch if the prefetch scheduler has sufficient lead time.
4. CXL.cache: bringing accelerator caches into the coherence domain
CXL.cache is architecturally the most interesting and least discussed of the three sub-protocols. It allows a CXL-attached device (an AI accelerator, GPU, or smart NIC) to issue cache-coherent memory accesses to the host CPU's memory — and crucially, those accesses participate in the host's cache coherence protocol.
What this means concretely: if an accelerator has a cache (most modern AI accelerators have substantial on-chip SRAM), and that accelerator uses CXL.cache, the accelerator's cache lines can participate in MESI state transitions alongside CPU cache lines. The host CPU's cache coherence protocol (maintained by the home agent in the CPU uncore) knows about the accelerator's cached copies and will send coherence probes to the accelerator when another agent (another CPU core, or another CXL.cache device) writes to the same data.
CXL.cache coherence transaction — accelerator reads shared dataScenario: CPU core 0 holds cache line X in Modified state
AI accelerator issues read to address of X via CXL.cache
CXL.cache transaction flow:
1. Accelerator sends CXL.cache Rd (read) request to host home agent
2. Home agent snoops CPU core 0 (holds Modified copy)
3. CPU core 0 writes back cache line X to memory, transitions to Invalid
4. Home agent sends data to accelerator
5. Accelerator caches line X in Shared state
— host memory has clean copy, accelerator has shared copy
Now: CPU core 0 issues write to address of X
CXL.cache snoop flow:
1. CPU write requires invalidating all Shared copies
2. Host home agent sends CXL.cache SnpInv (snoop invalidate) to accelerator
3. Accelerator must respond: if line is dirty, write it back; then invalidate
4. Only after accelerator's invalidation response does CPU core 0 proceed
Total coherence overhead: ~200-400 ns for cross-device coherence transaction
The coherence overhead — 200-400 ns per cross-device coherence transaction — is significant. It is 3-5× the latency of a local CPU cache snoop (~60-80 ns within a socket). This overhead is acceptable when coherence events are rare (the accelerator and CPU access different data most of the time) and becomes prohibitive when coherence events are frequent (the accelerator and CPU share a hot data structure).
4.1 Why this matters for AI accelerator design
CXL.cache is the mechanism that could eliminate the "bounce buffer" problem described in this series' prior essay. Currently, when a CPU needs to hand data to a GPU for processing, it must copy the data to a pinned buffer accessible to the GPU DMA engine. With CXL.cache, the GPU (if it implements CXL.cache) could simply read from the CPU's memory with coherent load operations — no explicit copy needed. The coherence protocol handles the transfer.
The practical barrier today is that most shipping GPU generations implement CXL.io (for device discovery) but not CXL.cache (for coherent access). Integrating CXL.cache into a GPU requires significant die area for the coherence agent logic and changes the GPU's memory model in ways that require software stack modifications. Future accelerator generations — particularly those targeting CPU-GPU shared memory programming models — are more likely to implement CXL.cache.
5. Device types: Type 1, 2, 3 and what each sub-protocol combination means
| CXL Device Type | Protocols Supported | Primary Use Case | Example Devices |
|---|---|---|---|
| Type 1 | CXL.io + CXL.cache | Accelerators that need coherent access to host memory but have no local DRAM | Smart NICs, FPGAs, security accelerators |
| Type 2 | CXL.io + CXL.cache + CXL.mem | Accelerators with local memory that need bidirectional coherent access | Future GPUs, AI training ASICs |
| Type 3 | CXL.io + CXL.mem | Memory expansion devices — add DRAM capacity to a host without GPU use | Samsung CMM-D, SK Hynix AiMM, Micron CZ120 |
The Type 3 memory expander is the device that has reached production first and is driving CXL adoption in AI clusters. Type 3 devices provide CXL.mem access to large DRAM capacities (current devices: 128 GB to 512 GB per module) at the 2-3× latency penalty. They are being deployed as KV cache expansion for long-context LLM inference, where the alternative is NVMe (far slower) or buying more GPUs (far more expensive).
Type 2 devices — GPUs or AI accelerators with full CXL.cache + CXL.mem — represent the future of CPU-GPU unified memory. They would allow a GPU to load tensors directly from CPU DRAM with coherent access, eliminating the explicit copy operations that currently waste PCIe bandwidth. No major GPU shipping today is a full CXL Type 2 device, though NVIDIA has announced intent to support CXL.mem in future Hopper+ generations and AMD is pursuing CXL Type 2 in their AI accelerator roadmap.
6. The latency reality: what 170-250 ns means for AI workloads
CXL.mem latency of 170-250 ns needs to be contextualized against the workloads that would use it:
| Memory Tier | Access Latency | Bandwidth | AI Use Case |
|---|---|---|---|
| GPU HBM | ~70-100 ns | 3.35 TB/s (H200) | Active weights, hot KV cache, activations |
| Local DDR5 (host) | ~80-100 ns | ~200 GB/s (8-ch) | CPU-side buffers, overflow KV staging |
| CXL.mem (Type 3) | ~170-250 ns | ~50-64 GB/s per device | Cold KV cache, weight staging, context store |
| CXL fabric (3.0) | ~300-500 ns | ~30-50 GB/s effective | Shared KV pool across multiple hosts |
| NVMe SSD | ~50,000-100,000 ns | ~7-12 GB/s | Checkpoint, cold weight storage |
Viewed against the alternatives, CXL.mem's latency is very good. It is 2-3× local DRAM — that sounds bad in isolation, but it is 200-400× better than NVMe. For KV cache overflow that would otherwise spill to NVMe, CXL.mem is a massive improvement. For data that should be in HBM but cannot fit, CXL.mem is a viable second tier.
The workloads where CXL.mem latency is problematic are those with random access patterns to hot data that cannot be predicted and prefetched. A transformer decode step that must access KV cache entries in an unpredictable order will stall on every CXL.mem access. A prefetch engine that can predict KV access patterns and issue prefetch requests 200+ ns in advance can hide the CXL latency entirely. This is why prefetch quality is the critical variable in CXL-augmented KV serving — and why the Memory Intent IR essay's concepts apply directly here.
7. CXL 3.0 and fabric switching: from point-to-point to memory mesh
CXL 1.0 and 2.0 are point-to-point protocols: one host connects to one device over a single PCIe link. CXL 2.0 added switching, allowing a single host to connect to multiple CXL devices through a CXL switch — but still with a single root host.
CXL 3.0, released in 2022 and with devices arriving in 2025-2026, changes the architecture fundamentally by adding fabric-level multi-host sharing. A CXL 3.0 memory device can be mapped into the address space of multiple hosts simultaneously. The coherence domain extends across all hosts sharing the device. This enables a rack-scale shared memory pool: a large DRAM device accessible by all servers in the rack, with CXL.cache coherence ensuring consistency.
CXL 3.0 fabric topology — shared memory poolCXL 3.0 Fabric (example: 4 hosts, 8 TB shared memory pool)
Host 0 ─┐ ┌─ CXL Memory Device A (512 GB)
Host 1 ─┤ CXL 3.0 Switch ├─ CXL Memory Device B (512 GB)
Host 2 ─┤ (Fabric Manager) ├─ CXL Memory Device C (512 GB)
Host 3 ─┘ └─ CXL Memory Device D (512 GB)
Each host sees all 4 devices in its physical address space
Coherence: CXL.cache multi-host coherence via Back-Invalidation
If Host 0 caches a line from Device A, and Host 2 writes to same line:
Back-Invalidation: Device A sends invalidation to Host 0's cache
Host 0 must evict the stale line before Host 2's write completes
Scale: CXL 3.0 spec supports up to 4,096 devices in a fabric pool
Practical 2025-2026: ~8-16 devices per switch due to silicon limits
CXL 3.0's shared memory capability is particularly relevant for disaggregated KV serving. Instead of each inference server maintaining its own KV cache pool, multiple servers can access a shared CXL memory fabric. A KV page generated by server A serving a prefill request can be accessed by server B serving the subsequent decode request without any explicit network transfer — the KV page is in shared CXL memory and server B reads it directly with load instructions.
The coherence overhead of multi-host CXL 3.0 — the Back-Invalidation mechanism — adds latency to write operations compared to single-host CXL 2.0. For KV cache workloads that are predominantly read-heavy (KV pages are written once during prefill, then read many times during decode), this overhead is acceptable. The write path pays the coherence cost once; the read path is unaffected.
8. The coherence assumption: why .cache is powerful and dangerous
CXL.cache's coherence guarantee comes with an assumption: all participants in the coherence domain behave correctly. The CPU's coherence protocol is implemented in validated silicon with decades of testing. A CXL.cache device from a new vendor may have subtle bugs in its coherence agent implementation that cause incorrect behavior — data corruption, stale reads, or deadlock — in ways that are extraordinarily difficult to debug.
CXL.cache coherence bugs produce silent data corruption — incorrect results that don't trigger exceptions, don't cause crashes, and don't generate error logs. A bug in a CXL.cache accelerator's coherence agent might cause it to serve stale cache lines in a narrow race condition that occurs once every million transactions. Finding this in production inference is nearly impossible without dedicated coherence testing infrastructure. This is the reason CXL.cache deployment should be treated with the same rigor as adding a new socket to a multi-processor system — the coherence agent must be validated before production deployment.
9. Which CXL sub-protocol matters for which AI use case
| AI Use Case | Relevant Sub-Protocol | Why | Maturity |
|---|---|---|---|
| KV cache overflow (long context) | CXL.mem (Type 3) | Add DRAM capacity CPU-addressable; 2-3× latency acceptable vs NVMe | Production ready (2025) |
| Weight staging (large models) | CXL.mem (Type 3) | Store overflow weights; prefetch to GPU HBM ahead of decode step | Production ready (2025) |
| GPU-CPU unified memory | CXL.cache (Type 2) | GPU cache coherent with CPU; eliminates copy buffers | Future — no shipping Type 2 GPU today |
| Shared KV pool across servers | CXL.mem + CXL 3.0 fabric | Multi-host memory sharing enables KV reuse across server boundaries | Early production (2025-2026) |
| Smart NIC tensor offload | CXL.cache (Type 1) | NIC reads tensors directly from CPU memory without DMA | Emerging — some shipping NICs |
10. CXL in the AI memory hierarchy: where it actually fits
CXL is not a replacement for HBM, DDR5, or NVMe. It is a new tier between local DDR5 and NVMe — faster than NVMe by 200-400×, slower than local DDR5 by 2-3×, and available in much larger capacities than HBM. For AI inference workloads that are fundamentally memory-capacity-bound — long-context KV serving, large model weight storage, session context persistence — CXL Type 3 memory fills a gap that has been genuinely painful with the existing two-tier hierarchy of HBM + NVMe.
The more significant long-term implications come from CXL 3.0 and Type 2 devices. CXL 3.0's shared fabric is a step toward the memory-disaggregated AI cluster — where memory is not physically attached to specific compute nodes but is a shared pool accessible across the rack. Type 2 devices are a step toward unified memory programming models where GPUs and CPUs share a coherent address space without explicit copy operations.
Both of these are three to five years from becoming infrastructure defaults. But the trajectory is clear: CXL is the protocol that will make memory disaggregation operational rather than theoretical. The current hardware — Type 3 devices from Samsung, SK Hynix, and Micron — is the first step. The protocol was designed for a much more ambitious destination.