What Is Disaggregated Memory?
Disaggregated memory decouples memory resources from the compute nodes that use them, placing memory in its own separate pool accessible by many hosts over a high-speed interconnect — rather than soldering it permanently to a single server's motherboard.
In a traditional server, memory (DRAM, HBM) is tightly bound to a specific CPU or GPU socket. It is fast to access, but it is stranded: if a workload needs more memory than that one server holds, you're stuck. If the workload finishes and only uses 30% of the memory, the rest sits idle. You can't share it, loan it, or repurpose it without a reboot and a rearchitecting of the physical system.
Disaggregated memory breaks this 1:1 binding. Memory is hosted in dedicated memory nodes — often called memory expanders or memory blades — connected to compute hosts via a fabric protocol like CXL (Compute Express Link), NVLink, or PCIe Gen5. Any host on the fabric can access any memory node, subject to permissions and bandwidth. Memory can be provisioned dynamically, shared across workloads, and scaled independently of compute.
Traditional architecture strands memory in per-server silos. Disaggregated memory pools it over a high-speed fabric, improving utilisation from ~35% to ~80% in typical deployments.
Key insight: Disaggregated memory does not make individual memory accesses faster — local HBM or DDR5 will always be lower latency than a networked pool. The value is in capacity, flexibility, utilisation, and total cost of ownership at scale.
Why This Matters Right Now
Three converging forces have pushed disaggregated memory from a research concept to a production priority between 2022 and 2025.
1. AI models have outgrown individual servers
GPT-3 (175B parameters) required roughly 350 GB of memory to hold the model weights in FP16. GPT-4 class models are estimated at 1+ trillion parameters — requiring multiple terabytes. No single server, even with 8× HBM3e-stacked GPUs, holds that. Inference serving must shard models across many nodes, and the interconnect between those nodes becomes the bottleneck. Disaggregated memory gives the cluster a larger, unified view of memory that individual workloads can tap without explicit sharding.
2. Memory efficiency is the new compute efficiency
Studies consistently show that for large language model inference, memory bandwidth — not raw FLOPS — is the dominant constraint for throughput and latency. A GPU sitting at 100% of its theoretical FLOPs but only 40% of memory bandwidth is a wasted investment. Disaggregated memory lets operators provision exactly the memory a workload needs, scale it independently of GPU count, and avoid the common pattern of over-buying compute to get more attached memory.
3. CXL has made it practical
Earlier memory disaggregation attempts (Gen-Z, OpenCAPI, CCIX) struggled with latency penalties, limited ecosystem adoption, and immature tooling. CXL (Compute Express Link), built on the PCIe physical layer, standardized memory semantic access over a high-speed fabric. CXL 1.1 (2020) enabled memory expansion. CXL 2.0 (2021) added memory pooling and switching. CXL 3.0 (2022) and 3.1 (2023) added peer-to-peer memory access and fabric topologies that make true memory disaggregation practical at data-center scale.
CXL's rapid generational cadence (3 major versions in 3 years) is a strong signal of industry urgency — hyperscalers and CPU vendors are all aligned on the protocol direction.
How Disaggregated Memory Works
At its core, disaggregated memory requires three components: memory devices (the actual DRAM), a transport fabric (the interconnect), and a management layer (software that maps virtual memory addresses to physical pool locations). The challenge is that each layer introduces latency compared to local memory — the goal is to minimize that overhead while maximising the flexibility gained.
Memory devices
Memory expander modules (sometimes called CMM — CXL Memory Modules) sit in dedicated memory drawers or blades. They commonly use DDR5 DRAM with a CXL controller ASIC that exposes the memory semantics over the CXL fabric. Some products incorporate HBM (High Bandwidth Memory) for bandwidth-intensive workloads. Emerging designs integrate compute-near-memory (CNM) elements — small processing units in the memory device that can pre-process data before sending it to the host.
The CXL fabric
CXL provides three key protocol layers: CXL.io (PCIe semantics, for device control), CXL.cache (coherency protocol, allowing devices to cache host memory), and CXL.mem (the critical one for disaggregation — host-managed device memory accessible via load/store operations). For pooled memory, a CXL switch sits between compute hosts and memory endpoints, routing memory traffic and enabling one-to-many and many-to-many topologies.
Software and memory management
The operating system (Linux kernel 5.18+ has initial CXL support, with substantive improvements through 6.x) exposes disaggregated memory as a NUMA node — a memory region that applications can allocate from, though with higher access latency than local memory. Frameworks like DAMON (Data Access MONitor), memtiering strategies, and AI-specific memory managers (NVIDIA's Unified Memory, PyTorch's expandable segments) can automatically migrate hot data to fast local memory and cooler data to the disaggregated pool.
The critical design insight: CXL pool memory (~200–350 ns) is slower than local DDR5 (~80 ns), but dramatically faster than RDMA-based disaggregation (~µs). For workloads that can tier their data, this is an acceptable tradeoff for massive capacity gains.
Primary Use Cases
KV Cache Offloading
The KV cache for long-context LLM inference grows linearly with sequence length and batch size. Storing it in a disaggregated pool rather than GPU HBM allows larger batch sizes without adding GPU nodes — directly improving throughput and cost per token.
Optimizer State & Activations
Optimizer states (Adam: 2× model size) and intermediate activations can be tiered to disaggregated memory during the forward pass, freeing HBM for weights and gradients. ZeRO-Infinity-style memory offload becomes viable at larger model sizes.
Memory as a Service
Cloud operators can pool memory across tenants dynamically — a VM that needs 512 GB for 2 hours does not force buying a server with 512 GB permanently attached. Memory utilization across the fleet improves dramatically.
In-memory Databases & Graphs
Graph analytics, genomics, and financial simulation workloads often require holding entire datasets in memory for random access. Disaggregated memory allows provisioning multi-TB memory domains for burst jobs without permanent allocation.
Emerging use case: Speculative decoding for LLM inference uses a small draft model to generate token candidates that the large model then verifies. Disaggregated memory lets both models coexist in a shared memory pool on fewer physical nodes, improving hardware utilisation during speculative decode.
Key Engineering Challenges
Latency gap: The fundamental challenge is that CXL pool memory adds ~120–270 ns of latency compared to local DDR5. For latency-sensitive code paths (synchronous cache lookups, hot data structures), this can reduce throughput significantly. The question for every workload is whether data can be tiered such that hot paths stay local.
Software complexity
Getting meaningful benefit from disaggregated memory requires the software stack — OS, runtime, ML framework — to be aware of memory topology. Applications that simply malloc() from the default allocator will get local memory and never touch the pool. Tiering requires profiling data access patterns, implementing page migration policies, and tuning NUMA affinity — none of which is trivial in production systems.
Coherency at scale
CXL 2.0 and 3.0 support multi-host access to shared memory, but coherency becomes increasingly complex as the number of hosts sharing a pool grows. Ensuring cache coherency across dozens of hosts accessing the same physical memory requires careful protocol design and adds latency overhead. CXL 3.1's back-invalidation mechanism helps, but the ecosystem is still maturing.
Reliability and serviceability
A memory pool failure in a disaggregated architecture can affect multiple compute hosts simultaneously — unlike a local DIMM failure that affects only one server. This makes reliability, fault isolation, and hot-swap capabilities critical design requirements. DRAM error correction (DDR5's on-die ECC, chipkill), fabric redundancy, and graceful failover are non-trivial engineering challenges.
Bandwidth contention
When multiple compute hosts share a pool, they share the CXL switch bandwidth. Heavy memory traffic from one host can affect latency for others. Bandwidth allocation, QoS policies, and traffic shaping at the switch level are required — and the standards and tooling for this are still evolving.
| Challenge | Severity | Current mitigation | Outlook |
|---|---|---|---|
| CXL pool latency vs. local DDR5 | Medium-High | Data tiering; hot data stays local | Improving with CXL 3.1+ |
| Software stack immaturity | High | Linux 6.x NUMA improvements, DAMON | Active but multi-year work |
| Coherency complexity at scale | Medium | CXL 3.1 back-invalidation | Protocol improving rapidly |
| Bandwidth contention (shared pool) | Medium | CXL switch QoS; workload isolation | Needs better tooling |
| Reliability / fault isolation | High | DDR5 on-die ECC; fabric redundancy | Production hardening ongoing |
| Standards ecosystem maturity | Medium | CXL Consortium, JEDEC CMM spec | 2025–2026 mainstream expected |
Vendor Landscape
The disaggregated memory ecosystem spans memory module makers, CXL switch ASIC vendors, CPU/GPU platform providers, and hyperscalers building their own solutions.
Memory Module & Controller Vendors
CXL Switch & Fabric Silicon
Platform & CPU Vendors
Hyperscalers & Systems Integrators
Where This Is Headed
The long arc: Disaggregated memory is the memory analog of what virtualization did to compute in the 2000s — it turns a fixed physical resource into a flexible, software-defined service. The technical path is harder (memory is lower latency and higher bandwidth than storage), but the economic logic is identical: better utilization, lower stranded capacity, and workload flexibility.
The Bottom Line
Disaggregated memory is the answer to a problem that will only grow more acute: AI models demand more memory than any single server can hold, and memory utilization in traditional architectures is chronically low. By breaking the fixed 1:1 bond between compute and memory, the industry gains the flexibility to scale memory and compute independently, share resources across workloads, and build systems whose economics improve as they grow.
The technology is no longer speculative. CXL is standardized, production CPUs support it, memory vendors are shipping modules, and hyperscalers are deploying. The challenges are real — latency overhead, software complexity, coherency at scale — but they are engineering challenges with tractable solutions, not fundamental physical limits.
The vendors positioned to win are those that solve the full stack: silicon (Samsung, Micron, SK Hynix for memory; Astera Labs, Xconn for switches), platform (Intel, AMD for CXL-native CPUs; NVIDIA for NVLink-based memory coherency), and software (MemVerge, Linux kernel, ML framework integrations). The hyperscalers (Meta, Google, AWS) will continue to co-design and pressure the ecosystem to move faster.
Disaggregated memory is infrastructure's next big shift — and the window to build expertise and ecosystem position in it is now.