Disaggregated Memory — The Architecture Reshaping AI Infrastructure

What Is Disaggregated Memory?

Disaggregated memory decouples memory resources from the compute nodes that use them, placing memory in its own separate pool accessible by many hosts over a high-speed interconnect — rather than soldering it permanently to a single server's motherboard.

In a traditional server, memory (DRAM, HBM) is tightly bound to a specific CPU or GPU socket. It is fast to access, but it is stranded: if a workload needs more memory than that one server holds, you're stuck. If the workload finishes and only uses 30% of the memory, the rest sits idle. You can't share it, loan it, or repurpose it without a reboot and a rearchitecting of the physical system.

Disaggregated memory breaks this 1:1 binding. Memory is hosted in dedicated memory nodes — often called memory expanders or memory blades — connected to compute hosts via a fabric protocol like CXL (Compute Express Link), NVLink, or PCIe Gen5. Any host on the fabric can access any memory node, subject to permissions and bandwidth. Memory can be provisioned dynamically, shared across workloads, and scaled independently of compute.

Architecture comparison — traditional vs. disaggregated

Traditional architecture strands memory in per-server silos. Disaggregated memory pools it over a high-speed fabric, improving utilisation from ~35% to ~80% in typical deployments.

Key insight: Disaggregated memory does not make individual memory accesses faster — local HBM or DDR5 will always be lower latency than a networked pool. The value is in capacity, flexibility, utilisation, and total cost of ownership at scale.

Why This Matters Right Now

Three converging forces have pushed disaggregated memory from a research concept to a production priority between 2022 and 2025.

1. AI models have outgrown individual servers

GPT-3 (175B parameters) required roughly 350 GB of memory to hold the model weights in FP16. GPT-4 class models are estimated at 1+ trillion parameters — requiring multiple terabytes. No single server, even with 8× HBM3e-stacked GPUs, holds that. Inference serving must shard models across many nodes, and the interconnect between those nodes becomes the bottleneck. Disaggregated memory gives the cluster a larger, unified view of memory that individual workloads can tap without explicit sharding.

2. Memory efficiency is the new compute efficiency

Studies consistently show that for large language model inference, memory bandwidth — not raw FLOPS — is the dominant constraint for throughput and latency. A GPU sitting at 100% of its theoretical FLOPs but only 40% of memory bandwidth is a wasted investment. Disaggregated memory lets operators provision exactly the memory a workload needs, scale it independently of GPU count, and avoid the common pattern of over-buying compute to get more attached memory.

3. CXL has made it practical

Earlier memory disaggregation attempts (Gen-Z, OpenCAPI, CCIX) struggled with latency penalties, limited ecosystem adoption, and immature tooling. CXL (Compute Express Link), built on the PCIe physical layer, standardized memory semantic access over a high-speed fabric. CXL 1.1 (2020) enabled memory expansion. CXL 2.0 (2021) added memory pooling and switching. CXL 3.0 (2022) and 3.1 (2023) added peer-to-peer memory access and fabric topologies that make true memory disaggregation practical at data-center scale.

CXL protocol evolution — capability by version

CXL's rapid generational cadence (3 major versions in 3 years) is a strong signal of industry urgency — hyperscalers and CPU vendors are all aligned on the protocol direction.

How Disaggregated Memory Works

At its core, disaggregated memory requires three components: memory devices (the actual DRAM), a transport fabric (the interconnect), and a management layer (software that maps virtual memory addresses to physical pool locations). The challenge is that each layer introduces latency compared to local memory — the goal is to minimize that overhead while maximising the flexibility gained.

Memory devices

Memory expander modules (sometimes called CMM — CXL Memory Modules) sit in dedicated memory drawers or blades. They commonly use DDR5 DRAM with a CXL controller ASIC that exposes the memory semantics over the CXL fabric. Some products incorporate HBM (High Bandwidth Memory) for bandwidth-intensive workloads. Emerging designs integrate compute-near-memory (CNM) elements — small processing units in the memory device that can pre-process data before sending it to the host.

The CXL fabric

CXL provides three key protocol layers: CXL.io (PCIe semantics, for device control), CXL.cache (coherency protocol, allowing devices to cache host memory), and CXL.mem (the critical one for disaggregation — host-managed device memory accessible via load/store operations). For pooled memory, a CXL switch sits between compute hosts and memory endpoints, routing memory traffic and enabling one-to-many and many-to-many topologies.

Software and memory management

The operating system (Linux kernel 5.18+ has initial CXL support, with substantive improvements through 6.x) exposes disaggregated memory as a NUMA node — a memory region that applications can allocate from, though with higher access latency than local memory. Frameworks like DAMON (Data Access MONitor), memtiering strategies, and AI-specific memory managers (NVIDIA's Unified Memory, PyTorch's expandable segments) can automatically migrate hot data to fast local memory and cooler data to the disaggregated pool.

"Disaggregated memory is most powerful not as a drop-in DRAM replacement, but as a tiered memory architecture where software intelligence keeps hot data close and cold data cheap."

Disaggregated memory system — full stack view

The critical design insight: CXL pool memory (~200–350 ns) is slower than local DDR5 (~80 ns), but dramatically faster than RDMA-based disaggregation (~µs). For workloads that can tier their data, this is an acceptable tradeoff for massive capacity gains.

Primary Use Cases

AI Inference

KV Cache Offloading

The KV cache for long-context LLM inference grows linearly with sequence length and batch size. Storing it in a disaggregated pool rather than GPU HBM allows larger batch sizes without adding GPU nodes — directly improving throughput and cost per token.

AI Training

Optimizer State & Activations

Optimizer states (Adam: 2× model size) and intermediate activations can be tiered to disaggregated memory during the forward pass, freeing HBM for weights and gradients. ZeRO-Infinity-style memory offload becomes viable at larger model sizes.

Multi-tenant Cloud

Memory as a Service

Cloud operators can pool memory across tenants dynamically — a VM that needs 512 GB for 2 hours does not force buying a server with 512 GB permanently attached. Memory utilization across the fleet improves dramatically.

HPC / Scientific

In-memory Databases & Graphs

Graph analytics, genomics, and financial simulation workloads often require holding entire datasets in memory for random access. Disaggregated memory allows provisioning multi-TB memory domains for burst jobs without permanent allocation.

Emerging use case: Speculative decoding for LLM inference uses a small draft model to generate token candidates that the large model then verifies. Disaggregated memory lets both models coexist in a shared memory pool on fewer physical nodes, improving hardware utilisation during speculative decode.

Key Engineering Challenges

Latency gap: The fundamental challenge is that CXL pool memory adds ~120–270 ns of latency compared to local DDR5. For latency-sensitive code paths (synchronous cache lookups, hot data structures), this can reduce throughput significantly. The question for every workload is whether data can be tiered such that hot paths stay local.

Software complexity

Getting meaningful benefit from disaggregated memory requires the software stack — OS, runtime, ML framework — to be aware of memory topology. Applications that simply malloc() from the default allocator will get local memory and never touch the pool. Tiering requires profiling data access patterns, implementing page migration policies, and tuning NUMA affinity — none of which is trivial in production systems.

Coherency at scale

CXL 2.0 and 3.0 support multi-host access to shared memory, but coherency becomes increasingly complex as the number of hosts sharing a pool grows. Ensuring cache coherency across dozens of hosts accessing the same physical memory requires careful protocol design and adds latency overhead. CXL 3.1's back-invalidation mechanism helps, but the ecosystem is still maturing.

Reliability and serviceability

A memory pool failure in a disaggregated architecture can affect multiple compute hosts simultaneously — unlike a local DIMM failure that affects only one server. This makes reliability, fault isolation, and hot-swap capabilities critical design requirements. DRAM error correction (DDR5's on-die ECC, chipkill), fabric redundancy, and graceful failover are non-trivial engineering challenges.

Bandwidth contention

When multiple compute hosts share a pool, they share the CXL switch bandwidth. Heavy memory traffic from one host can affect latency for others. Bandwidth allocation, QoS policies, and traffic shaping at the switch level are required — and the standards and tooling for this are still evolving.

Challenge	Severity	Current mitigation	Outlook
CXL pool latency vs. local DDR5	Medium-High	Data tiering; hot data stays local	Improving with CXL 3.1+
Software stack immaturity	High	Linux 6.x NUMA improvements, DAMON	Active but multi-year work
Coherency complexity at scale	Medium	CXL 3.1 back-invalidation	Protocol improving rapidly
Bandwidth contention (shared pool)	Medium	CXL switch QoS; workload isolation	Needs better tooling
Reliability / fault isolation	High	DDR5 on-die ECC; fabric redundancy	Production hardening ongoing
Standards ecosystem maturity	Medium	CXL Consortium, JEDEC CMM spec	2025–2026 mainstream expected

Vendor Landscape

The disaggregated memory ecosystem spans memory module makers, CXL switch ASIC vendors, CPU/GPU platform providers, and hyperscalers building their own solutions.

Memory Module & Controller Vendors

Samsung

KRX: 005930

Memory

CMM-H (CXL HBM), CMM-D (DDR5), CXL 2.0 modules

Leading DRAM vendor with the broadest CXL portfolio. CMM-H integrates HBM with a CXL controller for bandwidth-intensive workloads. Sampling CXL 3.0 modules for AI datacenter deployments since 2024.

SK Hynix

KRX: 000660

Memory

AiMX (Processing-in-Memory), CXL DRAM modules

Introduced AiMX, an HBM-based memory module with embedded AI processing capability, reducing data movement. Also offers standard CXL DDR5 expander modules for capacity-focused deployments.

Micron Technology

NASDAQ: MU

Memory

CXL DRAM (Type 3), HBM3E, LPDDR5X

Shipping CXL 2.0 memory expansion modules with DDR5 backend. Focus on capacity tiers for cloud deployments. HBM3E roadmap targets 1.2 TB/s bandwidth for AI accelerator memory.

Montage Technology

SSE: 688100

Memory

MXC (Memory eXpansion Controller) CXL ASIC

Designs the CXL controller ASIC that sits between DDR5 DRAM and the CXL fabric, used by multiple memory module vendors. Strong position in the China market for CXL disaggregation deployments.

CXL Switch & Fabric Silicon

Astera Labs

NASDAQ: ALAB

Fabric Silicon

Aries CXL SmartRetimer, Leo CXL Memory Controller, Scorpio CXL Switch

Purpose-built CXL silicon for memory disaggregation. Scorpio is a 1024-port CXL fabric switch enabling rack-scale memory pooling. Leo provides intelligent memory controller for pool management. Major design wins with Meta, Google, and AWS.

Microchip Technology

NASDAQ: MCHP

Fabric Silicon

Switchtec PCIe/CXL switches, PFX CXL fabric switches

Acquired Microsemi to strengthen PCIe/CXL switching portfolio. PFX series targets data-center CXL fabric deployments with low-latency switching between compute hosts and memory pools.

Xconn Technologies

Private

Startup

XC50256 CXL 2.0 switch (256-port), CXL fabric management software

CXL switch ASIC startup with one of the highest port count designs. Targeting hyperscale deployments. Closed Series B in 2024 backed by Samsung and SK Hynix — a strong signal of memory vendor conviction in the CXL switching market.

IntelliProp

Private

Startup

CXL fabric switch IP and ASIC designs

Semiconductor IP and custom ASIC design for CXL fabric switching. Working with system vendors to build disaggregated memory platforms. Focus on low-latency switch designs optimized for HPC workloads.

Platform & CPU Vendors

Intel

NASDAQ: INTC

CPU Platform

Sapphire Rapids (CXL 1.1), Granite Rapids (CXL 2.0), Gaudi AI accelerators with CXL

Intel co-invented CXL and shipped the first server CPUs with native CXL support (Sapphire Rapids, 2023). Granite Rapids (2024) adds CXL 2.0 support enabling memory pooling. Strong investment in CXL ecosystem development and reference designs.

AMD

NASDAQ: AMD

CPU Platform

EPYC Genoa (CXL 1.1), EPYC Turin (CXL 2.0), Instinct MI300X with CXL coherency

EPYC Genoa (2022) shipped with CXL 1.1 support. Turin (Zen 5, 2024) advances to CXL 2.0, enabling memory pooling. The MI300X APU integrates CPU and GPU in one package with unified HBM — a complementary approach to disaggregation at the chip level.

NVIDIA

NASDAQ: NVDA

GPU Platform

NVLink Switch, Grace Hopper Superchip, Unified Memory, NVLink-C2C

NVIDIA's memory disaggregation strategy centers on NVLink (not CXL) for GPU-to-GPU memory access, and the Grace Hopper Superchip integrates CPU+GPU over NVLink-C2C achieving 900 GB/s bandwidth. For disaggregated pools, NVIDIA supports CXL as a host memory expansion path on Grace CPU side.

Arm / Ampere

LSE: ARM

CPU Platform

Ampere Altra Max (CXL-aware), Arm SystemReady CXL profiles

Arm-based server CPUs are adopting CXL as first-class memory fabric. Ampere Computing's cloud-native processors support CXL memory expansion, targeting the large cloud-native market where memory efficiency is a key economic lever.

Hyperscalers & Systems Integrators

Meta (Facebook)

NASDAQ: META

Hyperscaler

Grand Teton AI server, custom CXL memory blade designs

One of the most aggressive deployers of CXL disaggregated memory. Partnered with Astera Labs for CXL retimers and switches. Actively deploying CXL memory expansion in AI training clusters for Llama model development.

Google

NASDAQ: GOOGL

Hyperscaler

TPU v5 host memory expansion via CXL; custom memory fabric for GKE clusters

Google contributes heavily to Linux CXL drivers (upstream in kernel 5.18+) and is deploying CXL memory expansion in TPU host nodes. Their Titanium SmartNIC approach also influences how memory disaggregation integrates with networking.

Samsung SAIT / MemVerge

—

Software

MemVerge Memory Machine, CXL memory OS

MemVerge provides the software layer for CXL memory management — abstracting pool memory as a NUMA resource, enabling tiering policies, snapshotting memory state, and providing live migration for disaggregated memory workloads.

Enfabrica

Private

Startup

ACF (Accelerated Compute Fabric) chip, unified compute-memory-network fabric

Building a unified fabric chip that collapses memory disaggregation, networking, and storage into a single coherent fabric layer for AI clusters. Backed by Qualcomm Ventures, Nvidia, and major hyperscalers. Represents the next architectural evolution beyond CXL alone.

Where This Is Headed

2023–2024 — EARLY DEPLOYMENT

CXL 1.1/2.0 memory expansion modules ship in volume. Intel Sapphire Rapids and AMD Genoa provide first-generation native CXL host support. Hyperscalers (Meta, Google) begin early production deployments for AI workloads. MemVerge and Astera Labs build software/silicon ecosystem.

2025–2026 — MAINSTREAM ADOPTION

CXL 3.0/3.1 switches enable rack-scale pooling with multiple compute hosts sharing TB-scale memory pools. Intel Granite Rapids and AMD Turin drive CXL 2.0 volume. Linux kernel memory tiering matures. AI frameworks (PyTorch, JAX) add native disaggregated memory support. First cloud instance types offering disaggregated memory as a service appear.

2027+ — COMPOSABLE INFRASTRUCTURE

Memory, compute, and storage become fully composable fabric resources. CXL 4.0+ enables global coherency across racks. Processing-in-memory (PIM) within disaggregated pools reduces data movement. AI clusters are dynamically reconfigured per workload — memory pools grow and shrink in minutes rather than requiring server provisioning. Near-memory compute (Enfabrica-style fabrics) blur the compute/memory boundary further.

The long arc: Disaggregated memory is the memory analog of what virtualization did to compute in the 2000s — it turns a fixed physical resource into a flexible, software-defined service. The technical path is harder (memory is lower latency and higher bandwidth than storage), but the economic logic is identical: better utilization, lower stranded capacity, and workload flexibility.

The Bottom Line

Disaggregated memory is the answer to a problem that will only grow more acute: AI models demand more memory than any single server can hold, and memory utilization in traditional architectures is chronically low. By breaking the fixed 1:1 bond between compute and memory, the industry gains the flexibility to scale memory and compute independently, share resources across workloads, and build systems whose economics improve as they grow.

The technology is no longer speculative. CXL is standardized, production CPUs support it, memory vendors are shipping modules, and hyperscalers are deploying. The challenges are real — latency overhead, software complexity, coherency at scale — but they are engineering challenges with tractable solutions, not fundamental physical limits.

The vendors positioned to win are those that solve the full stack: silicon (Samsung, Micron, SK Hynix for memory; Astera Labs, Xconn for switches), platform (Intel, AMD for CXL-native CPUs; NVIDIA for NVLink-based memory coherency), and software (MemVerge, Linux kernel, ML framework integrations). The hyperscalers (Meta, Google, AWS) will continue to co-design and pressure the ecosystem to move faster.

Disaggregated memory is infrastructure's next big shift — and the window to build expertise and ecosystem position in it is now.