The Memory Problem Behind Modern AI Inference
Modern AI inference is increasingly constrained not just by compute, but by memory. GPUs can perform enormous numbers of operations per second, but large language models require fast access to model weights, activations, and, most importantly for long-context inference, the KV cache.
The KV cache stores the attention keys and values generated from previous tokens. It allows the model to avoid recomputing the full history every time it produces a new token. This is essential for efficient inference. But the cache grows with sequence length, number of layers, number of attention heads, batch size, and precision.
As context windows expand and many users are served concurrently, the KV cache can consume a massive amount of memory. This creates a practical infrastructure question:
Where should the KV cache live?
For a transformer using grouped-query attention, the practical KV cache estimate is:
KV cache per sequence =
2 x layers x KV_heads x head_dim x seq_len x bytes_per_value
For a Llama-3-class 70B model with 80 layers, 8 KV heads, 128 head dimension, 128K context, and FP16 KV storage, that is roughly 40 GB per active sequence. At 32 concurrent users, the KV cache alone is about 1.25 TiB, before model weights, runtime buffers, fragmentation, and scheduling headroom.
The simpler full-attention shortcut, 2 x layers x hidden_dim x seq_len x bytes, produces a much larger ~335 GB per sequence estimate because it assumes every query head has its own KV state. That is useful as a worst-case intuition, but GQA/MQA is exactly why modern LLMs avoid that full blow-up.
The fastest place is GPU HBM. The cheapest and largest place is storage. But neither extreme is perfect. HBM is fast but expensive and limited. Storage is abundant but too slow for many real-time inference paths.
Where CXL Fits
CXL, or Compute Express Link, is an open interconnect standard that allows CPUs, accelerators, and memory devices to communicate with low latency and cache or memory semantics over PCIe-based physical links. In simpler terms, it can make additional memory appear closer to the compute system than traditional storage, while allowing more flexible memory expansion and pooling than conventional server DRAM.
For memory expansion, the relevant hardware category is usually a CXL Type 3 device. Type 3 devices expose device-attached memory to a host through CXL.mem, alongside CXL.io for discovery, configuration, and management. That is the specification domain behind most CXL memory expanders, persistent memory modules, and memory pooling appliances.
| Memory Tier | Bandwidth | Latency | Capacity per Server | Relative $/GB | Best Use Case |
|---|---|---|---|---|---|
| GPU HBM3e | ~3-5 TB/s per high-end GPU class | ~100 ns class, closest to GPU compute | 80-192 GB per GPU | Highest | Active weights, hottest KV cache, latency-critical decode path. |
| CXL 3.0 Memory | ~64-128 GB/s class with link aggregation | ~250-500 ns class depending on topology | 1-8 TB+ per server or appliance class | Middle | Warm KV cache, context spill, memory expansion, model-serving headroom. |
| DDR5 DRAM | ~200-400 GB/s aggregate server class | ~80-120 ns CPU-local class | 1-2 TB common, higher in dense servers | Baseline | CPU host memory, preprocessing, retrieval, orchestration, staging. |
| NVMe SSD | ~7-14 GB/s per high-end Gen4/Gen5 drive | ~10,000-50,000 ns through storage path | 30-60 TB+ per server | Lowest | Cold cache, checkpointing, persistence, archival context. |
For AI inference, CXL is attractive because it may create a new memory tier:
GPU HBM -> CXL memory -> CPU DRAM -> NVMe / storage
The goal is to reduce pressure on HBM by moving some memory-heavy data, such as less immediately needed KV cache blocks, into a larger external memory pool that is still much faster and more memory-like than SSD storage.
KV Cache Tiering Path
Alternatives to CXL
CXL is only one answer to the memory bottleneck. It competes with several other approaches, each with a different balance of speed, cost, capacity, maturity, and operational complexity.
1. Keeping Everything in GPU HBM
The simplest and highest-performance approach is to keep the model and KV cache inside GPU HBM.
model weights + active KV cache = GPU HBM
HBM offers extremely high bandwidth and low latency. For latency-sensitive workloads, this is ideal. There is no need to fetch KV blocks from slower external memory tiers, and system complexity is reduced.
The drawback is cost and capacity. GPU HBM is expensive and limited. If a workload mainly needs more memory capacity rather than more compute, buying more GPUs just to get more HBM can be inefficient. A cluster may end up with underutilized compute simply because memory is the bottleneck.
2. CPU DRAM Offload
A common alternative is to offload KV cache from GPU memory into regular server DRAM.
hot KV cache = GPU HBM
warm KV cache = CPU DRAM
CPU DRAM is far cheaper and more abundant than GPU HBM. It is also already present in conventional servers, which makes this approach practical and accessible.
The problem is bandwidth and latency. Moving data between GPU and CPU memory usually involves PCIe or a similar interconnect. That is much slower than accessing local HBM. If the inference runtime frequently needs KV blocks that have been offloaded to CPU memory, performance can degrade.
3. NVMe and SSD Offload
Another alternative is to spill KV cache to fast SSDs. This offers much greater capacity and lower cost per gigabyte.
hot KV cache = GPU HBM
warm KV cache = CPU DRAM or CXL memory
cold KV cache = NVMe SSD
NVMe storage is dramatically cheaper per gigabyte than HBM or DRAM. It can support very large context windows and large numbers of stored sessions. But latency is the key limitation. SSDs are fast compared with older storage, but they are slow compared with memory.
Therefore, NVMe is better thought of as a cold tier. It may be useful for very long-running sessions, archival context, checkpointing, or workloads where latency is less strict. It is usually not the preferred place for active KV cache in high-performance inference.
4. CXL Memory Expansion and Pooling
CXL attempts to occupy the middle ground. It is not as fast as GPU HBM, but it can be much larger and cheaper. It is not as cheap as SSD storage, but it is far faster and more memory-like.
GPU HBM = hottest KV blocks and model-critical data
CXL memory = large shared pool for warm KV cache
NVMe = cold spillover and persistence
The main promise is better utilization. Instead of buying more GPUs just to obtain more HBM, an operator can attach or pool additional memory. This can help serve longer contexts, more concurrent users, or agentic workflows that maintain larger working histories.
CXL also matters because it is a standard. Proprietary memory fabrics can be powerful, but standards tend to encourage broader ecosystems: memory modules, switches, CPUs, accelerators, servers, and software stacks that interoperate.
However, CXL is not magic. It still has latency, bandwidth, software, topology, and adoption constraints. Performance depends heavily on how the inference runtime decides what data should live in HBM versus CXL memory. Poor placement can erase the benefits. Good placement can make CXL a useful pressure valve for memory-bound inference.
5. Distributed KV Cache Across GPUs or Nodes
Another strategy is to distribute KV cache across multiple GPUs or servers.
GPU 1 stores part of the KV cache
GPU 2 stores part of the KV cache
GPU 3 stores part of the KV cache
This can work well in large-scale systems, especially when paired with high-speed interconnects such as NVLink, NVSwitch, InfiniBand, or high-performance Ethernet with RDMA.
The upside is that the system uses memory already present across the cluster. The downside is complexity. Distributed cache management, network traffic, synchronization, and failure handling all become harder. Performance also depends on the quality of the interconnect.
6. KV Cache Compression
Instead of expanding memory, another path is to reduce how much memory the KV cache consumes.
FP16 KV cache -> INT8 KV cache -> INT4 KV cache
Lower precision reduces the number of bytes per cached key/value entry. Other methods selectively evict less important tokens, keep only recent context, compress older context, or use approximate attention techniques.
The advantage is obvious: if the KV cache becomes smaller, the system needs less HBM, less DRAM, less CXL memory, and less storage. The tradeoff is quality and complexity. More aggressive compression may reduce model accuracy, degrade recall over long contexts, or introduce edge-case behavior.
7. Model Architecture Changes
Another way to reduce KV pressure is to change the model itself. Standard transformer attention creates large KV caches. Newer or modified architectures can reduce this burden.
Multi-Query Attention
Grouped-Query Attention
sliding-window attention
sparse attention
state-space models
linear attention variants
hybrid architectures
These architectural changes are powerful because they attack the root cause. But they also require model-level choices. Infrastructure providers cannot always assume customers will use a specific architecture. Many high-value workloads will continue to run transformer models with large KV caches, especially when long-context accuracy matters.
8. Photonic and Optical Memory Interconnects
A more future-facing alternative is photonic memory connectivity. Photonics uses light rather than electrical signaling for data movement. In AI infrastructure, the most relevant near-term use is not necessarily all-optical memory, but optical interconnects that move data between compute and memory with higher bandwidth and potentially better power efficiency.
The motivation is clear: as AI clusters scale, moving data becomes one of the dominant constraints. Electrical links face limits in bandwidth, reach, heat, and energy efficiency. Optical links may help connect memory pools, accelerators, and racks more efficiently.
For now, photonic memory systems are less mature than GPU HBM, CPU DRAM offload, NVMe tiering, or CXL memory expansion. But the direction is important. If AI workloads continue to become more memory-bound, optical interconnects may become central to next-generation memory architectures.
The Software Bottleneck: NUMA, Pages, and Jitter
CXL memory expansion is not only a hardware story. It becomes useful for inference only when the software stack can place memory intentionally. If a CXL pool is exposed to Linux as a slower NUMA node and the inference runtime treats it as ordinary anonymous memory, the kernel may migrate pages reactively, compact huge pages, or fault data into the wrong tier at exactly the wrong time.
That matters because token generation is latency-sensitive. A background page migration, a reactive page fault, or a poorly timed huge-page operation can create tail-latency spikes that are much more damaging than a modest average-latency penalty. For high-concurrency serving, jitter can be as important as raw bandwidth.
bad path:
anonymous allocation -> page fault -> kernel migrates into slow tier -> decode jitter
better path:
runtime block manager -> explicit HBM/CXL placement -> predictable token latency
Vendor Reality Check
The CXL ecosystem is no longer purely theoretical, but it is also not yet as mature as PCIe SSDs or conventional DDR memory. The market is in the early deployment phase: modules, controllers, platform validation, and software stacks are converging, with adoption led by hyperscale, cloud, HPC, in-memory database, and AI inference use cases.
| Company | Representative CXL Offering | Status / Signal | Relevant Use Case |
|---|---|---|---|
| Samsung | CMM-D and CMM-H memory modules | CMM-D sampling; CMM-H positioned as a DRAM + NAND CXL Type 3 device for tiered memory and persistence. | Memory expansion, tiered memory, AI inference, analytics, in-memory databases. |
| Micron | CZ120 CXL memory expansion module | Publicly positioned for Supermicro Petascale platforms and CXL 2.0 adoption. | Server memory expansion for AI, HPC, and memory-intensive platforms. |
| Astera Labs | Leo CXL Smart Memory Controllers | Controller and add-in-card platform for memory expansion, sharing, and pooling; public cloud validation signals emerging platform adoption. | CXL controller layer, pooling, sharing, diagnostics, cloud server memory scaling. |
| Marvell | Structera CXL controllers | Interoperability-focused memory expansion controller platform aimed at high-capacity server memory. | CXL controller silicon, memory expansion, near-memory compute direction. |
The Tradeoff Map
The memory hierarchy for AI inference can be understood as a tradeoff between speed, capacity, cost, and complexity.
| Memory Tier | Typical Bandwidth | Approximate Latency Class | Engineering Interpretation |
|---|---|---|---|
| GPU HBM3e | Several TB/s per GPU, roughly ~1 TB/s+ per HBM stack class | Lowest practical tier for GPU-resident data | Best for model-critical state and hottest KV blocks. |
| Local CPU DDR5 | ~40-60 GB/s per channel, platform aggregate in the hundreds of GB/s | ~tens to low hundreds of ns CPU-local access class | Cheap and mature, but crossing into GPU use adds interconnect cost. |
| CXL 2.0 / 3.0 Type 3 | ~32-64 GB/s class per PCIe Gen5 x16 link, higher with aggregation | Often modeled as a slower NUMA memory tier, roughly hundreds of ns depending on topology | A warm tier: much slower than HBM, much more memory-like than SSD. |
| NVMe Gen5 SSD | ~10-14 GB/s sequential class per high-end drive | Microseconds to tens of microseconds through the block/storage path | Excellent cold capacity tier, dangerous for active token-path KV reuse. |
When HBM Wins
Shorter contexts, strict latency targets, premium workloads, and deployments where cost is secondary to speed.
When CXL Becomes Interesting
Long context, high concurrency, large active sessions, and cases where buying more GPUs just for memory is inefficient.
When NVMe Helps
Cold cache, persistence, archival context, checkpointing, and workloads that can tolerate higher access latency.
When Software Wins
Compression, better attention patterns, and new model architectures can reduce the memory burden before hardware is added.
Conclusion
There is no single universal winner. The right architecture depends on workload. For a low-latency chatbot with short prompts, GPU HBM may be enough. For enterprise agents reading long documents while serving many users, HBM alone may become too expensive. For long-context workloads with moderate latency tolerance, DRAM or NVMe offload may be acceptable.
For high-concurrency enterprise inference, CXL is compelling because it offers a larger memory tier without falling all the way down to storage-like latency. It represents a serious attempt to introduce a new layer into the AI memory hierarchy: larger than HBM, faster than storage, more flexible than conventional server DRAM, and more standardized than proprietary memory fabrics.
The sharper stance is this: if context windows keep growing, if agent loops retain longer working state, and if enterprises expect many concurrent users with low latency, a middle memory tier becomes very hard to avoid. CXL is the most credible open-standard candidate for that tier in the near term.
The counter-case is equally clear. If model architectures reduce KV cache by an order of magnitude, if compression preserves quality at very low precision, or if GPU HBM capacity expands faster than inference demand, CXL remains important but more specialized. The market will be decided by the gap between model memory appetite and affordable HBM supply.
Sources and Notes
Performance figures are approximate engineering ranges, not guaranteed product specifications. They are included to establish order-of-magnitude boundaries between HBM, DRAM, CXL memory, and NVMe storage.
- Linux kernel CXL device type documentation for CXL device categories and protocol framing.
- Samsung CMM-H technical blog for CXL Type 3, DRAM + NAND tiered memory, and CXL.mem positioning.
- Micron CZ120 and Supermicro Petascale announcement for CXL memory expansion module status.
- Astera Labs Leo CXL Memory Connectivity Platform for CXL controller, expansion, sharing, and pooling positioning.
- Marvell Structera interoperability announcement for CXL controller and near-memory compute ecosystem status.
- Optimizing System Memory Bandwidth with Micron CXL Memory Expansion Modules on Intel Xeon 6 Processors for CXL Type 3 bandwidth and system-level evaluation context.