MAN\SH AI / Writings

· AI Memory Systems · 16 min read

AI Infrastructure Essay

CXL and the Search for a New Memory Tier in AI Inference

As AI inference shifts from short prompts to long-context, high-concurrency, agentic workloads, the limiting factor is no longer just compute. Increasingly, the decisive constraint is memory: how much is available, how fast it can be accessed, and how intelligently it can be tiered.

Topic: CXL memory, KV cache, AI inference Format: standalone HTML essay Use: readable in browser or printable to PDF

The Memory Problem Behind Modern AI Inference

Modern AI inference is increasingly constrained not just by compute, but by memory. GPUs can perform enormous numbers of operations per second, but large language models require fast access to model weights, activations, and, most importantly for long-context inference, the KV cache.

The KV cache stores the attention keys and values generated from previous tokens. It allows the model to avoid recomputing the full history every time it produces a new token. This is essential for efficient inference. But the cache grows with sequence length, number of layers, number of attention heads, batch size, and precision.

As context windows expand and many users are served concurrently, the KV cache can consume a massive amount of memory. This creates a practical infrastructure question:

Where should the KV cache live?
Do the math: why HBM alone breaks

For a transformer using grouped-query attention, the practical KV cache estimate is:

KV cache per sequence =
2 x layers x KV_heads x head_dim x seq_len x bytes_per_value

For a Llama-3-class 70B model with 80 layers, 8 KV heads, 128 head dimension, 128K context, and FP16 KV storage, that is roughly 40 GB per active sequence. At 32 concurrent users, the KV cache alone is about 1.25 TiB, before model weights, runtime buffers, fragmentation, and scheduling headroom.

The simpler full-attention shortcut, 2 x layers x hidden_dim x seq_len x bytes, produces a much larger ~335 GB per sequence estimate because it assumes every query head has its own KV state. That is useful as a worst-case intuition, but GQA/MQA is exactly why modern LLMs avoid that full blow-up.

The fastest place is GPU HBM. The cheapest and largest place is storage. But neither extreme is perfect. HBM is fast but expensive and limited. Storage is abundant but too slow for many real-time inference paths.

Where CXL Fits

CXL, or Compute Express Link, is an open interconnect standard that allows CPUs, accelerators, and memory devices to communicate with low latency and cache or memory semantics over PCIe-based physical links. In simpler terms, it can make additional memory appear closer to the compute system than traditional storage, while allowing more flexible memory expansion and pooling than conventional server DRAM.

For memory expansion, the relevant hardware category is usually a CXL Type 3 device. Type 3 devices expose device-attached memory to a host through CXL.mem, alongside CXL.io for discovery, configuration, and management. That is the specification domain behind most CXL memory expanders, persistent memory modules, and memory pooling appliances.

CXL sits in the missing middle: materially cheaper and larger than HBM, much lower latency and more memory-like than NVMe. Values are approximate, platform-dependent ranges meant for tier comparison rather than procurement.
Memory Tier Bandwidth Latency Capacity per Server Relative $/GB Best Use Case
GPU HBM3e ~3-5 TB/s per high-end GPU class ~100 ns class, closest to GPU compute 80-192 GB per GPU Highest Active weights, hottest KV cache, latency-critical decode path.
CXL 3.0 Memory ~64-128 GB/s class with link aggregation ~250-500 ns class depending on topology 1-8 TB+ per server or appliance class Middle Warm KV cache, context spill, memory expansion, model-serving headroom.
DDR5 DRAM ~200-400 GB/s aggregate server class ~80-120 ns CPU-local class 1-2 TB common, higher in dense servers Baseline CPU host memory, preprocessing, retrieval, orchestration, staging.
NVMe SSD ~7-14 GB/s per high-end Gen4/Gen5 drive ~10,000-50,000 ns through storage path 30-60 TB+ per server Lowest Cold cache, checkpointing, persistence, archival context.

For AI inference, CXL is attractive because it may create a new memory tier:

GPU HBM  ->  CXL memory  ->  CPU DRAM  ->  NVMe / storage
The key idea: CXL is not the premium tier for the hottest data. GPU HBM remains that tier. CXL is a candidate for warm, large, memory-like capacity, especially for KV cache blocks that do not need to occupy scarce HBM at every moment.

The goal is to reduce pressure on HBM by moving some memory-heavy data, such as less immediately needed KV cache blocks, into a larger external memory pool that is still much faster and more memory-like than SSD storage.

KV Cache Tiering Path

KV cache movement across memory tiers Tokens flow into GPUs, hottest KV blocks stay in HBM, warm blocks spill to CXL memory, and cold blocks fall back to NVMe storage. Tokens prompt + decode GPU HBM model weights hottest KV blocks lowest latency CXL Type 3 CXL.mem expander warm KV pool capacity relief NVMe SSD cold spillover runtime pulls reused KV blocks back toward HBM

Alternatives to CXL

CXL is only one answer to the memory bottleneck. It competes with several other approaches, each with a different balance of speed, cost, capacity, maturity, and operational complexity.

1. Keeping Everything in GPU HBM

The simplest and highest-performance approach is to keep the model and KV cache inside GPU HBM.

model weights + active KV cache = GPU HBM

HBM offers extremely high bandwidth and low latency. For latency-sensitive workloads, this is ideal. There is no need to fetch KV blocks from slower external memory tiers, and system complexity is reduced.

The drawback is cost and capacity. GPU HBM is expensive and limited. If a workload mainly needs more memory capacity rather than more compute, buying more GPUs just to get more HBM can be inefficient. A cluster may end up with underutilized compute simply because memory is the bottleneck.

2. CPU DRAM Offload

A common alternative is to offload KV cache from GPU memory into regular server DRAM.

hot KV cache  = GPU HBM
warm KV cache = CPU DRAM

CPU DRAM is far cheaper and more abundant than GPU HBM. It is also already present in conventional servers, which makes this approach practical and accessible.

The problem is bandwidth and latency. Moving data between GPU and CPU memory usually involves PCIe or a similar interconnect. That is much slower than accessing local HBM. If the inference runtime frequently needs KV blocks that have been offloaded to CPU memory, performance can degrade.

3. NVMe and SSD Offload

Another alternative is to spill KV cache to fast SSDs. This offers much greater capacity and lower cost per gigabyte.

hot KV cache  = GPU HBM
warm KV cache = CPU DRAM or CXL memory
cold KV cache = NVMe SSD

NVMe storage is dramatically cheaper per gigabyte than HBM or DRAM. It can support very large context windows and large numbers of stored sessions. But latency is the key limitation. SSDs are fast compared with older storage, but they are slow compared with memory.

Therefore, NVMe is better thought of as a cold tier. It may be useful for very long-running sessions, archival context, checkpointing, or workloads where latency is less strict. It is usually not the preferred place for active KV cache in high-performance inference.

4. CXL Memory Expansion and Pooling

CXL attempts to occupy the middle ground. It is not as fast as GPU HBM, but it can be much larger and cheaper. It is not as cheap as SSD storage, but it is far faster and more memory-like.

GPU HBM    = hottest KV blocks and model-critical data
CXL memory = large shared pool for warm KV cache
NVMe       = cold spillover and persistence

The main promise is better utilization. Instead of buying more GPUs just to obtain more HBM, an operator can attach or pool additional memory. This can help serve longer contexts, more concurrent users, or agentic workflows that maintain larger working histories.

CXL also matters because it is a standard. Proprietary memory fabrics can be powerful, but standards tend to encourage broader ecosystems: memory modules, switches, CPUs, accelerators, servers, and software stacks that interoperate.

However, CXL is not magic. It still has latency, bandwidth, software, topology, and adoption constraints. Performance depends heavily on how the inference runtime decides what data should live in HBM versus CXL memory. Poor placement can erase the benefits. Good placement can make CXL a useful pressure valve for memory-bound inference.

5. Distributed KV Cache Across GPUs or Nodes

Another strategy is to distribute KV cache across multiple GPUs or servers.

GPU 1 stores part of the KV cache
GPU 2 stores part of the KV cache
GPU 3 stores part of the KV cache

This can work well in large-scale systems, especially when paired with high-speed interconnects such as NVLink, NVSwitch, InfiniBand, or high-performance Ethernet with RDMA.

The upside is that the system uses memory already present across the cluster. The downside is complexity. Distributed cache management, network traffic, synchronization, and failure handling all become harder. Performance also depends on the quality of the interconnect.

6. KV Cache Compression

Instead of expanding memory, another path is to reduce how much memory the KV cache consumes.

FP16 KV cache  ->  INT8 KV cache  ->  INT4 KV cache

Lower precision reduces the number of bytes per cached key/value entry. Other methods selectively evict less important tokens, keep only recent context, compress older context, or use approximate attention techniques.

The advantage is obvious: if the KV cache becomes smaller, the system needs less HBM, less DRAM, less CXL memory, and less storage. The tradeoff is quality and complexity. More aggressive compression may reduce model accuracy, degrade recall over long contexts, or introduce edge-case behavior.

7. Model Architecture Changes

Another way to reduce KV pressure is to change the model itself. Standard transformer attention creates large KV caches. Newer or modified architectures can reduce this burden.

Multi-Query Attention
Grouped-Query Attention
sliding-window attention
sparse attention
state-space models
linear attention variants
hybrid architectures

These architectural changes are powerful because they attack the root cause. But they also require model-level choices. Infrastructure providers cannot always assume customers will use a specific architecture. Many high-value workloads will continue to run transformer models with large KV caches, especially when long-context accuracy matters.

8. Photonic and Optical Memory Interconnects

A more future-facing alternative is photonic memory connectivity. Photonics uses light rather than electrical signaling for data movement. In AI infrastructure, the most relevant near-term use is not necessarily all-optical memory, but optical interconnects that move data between compute and memory with higher bandwidth and potentially better power efficiency.

The motivation is clear: as AI clusters scale, moving data becomes one of the dominant constraints. Electrical links face limits in bandwidth, reach, heat, and energy efficiency. Optical links may help connect memory pools, accelerators, and racks more efficiently.

For now, photonic memory systems are less mature than GPU HBM, CPU DRAM offload, NVMe tiering, or CXL memory expansion. But the direction is important. If AI workloads continue to become more memory-bound, optical interconnects may become central to next-generation memory architectures.

The Software Bottleneck: NUMA, Pages, and Jitter

CXL memory expansion is not only a hardware story. It becomes useful for inference only when the software stack can place memory intentionally. If a CXL pool is exposed to Linux as a slower NUMA node and the inference runtime treats it as ordinary anonymous memory, the kernel may migrate pages reactively, compact huge pages, or fault data into the wrong tier at exactly the wrong time.

That matters because token generation is latency-sensitive. A background page migration, a reactive page fault, or a poorly timed huge-page operation can create tail-latency spikes that are much more damaging than a modest average-latency penalty. For high-concurrency serving, jitter can be as important as raw bandwidth.

Runtime implication: Systems such as vLLM, SGLang, TensorRT-LLM, or custom serving stacks need explicit awareness of memory tiers. The practical pattern is to pin hot KV blocks in HBM, allocate warm blocks in CXL memory through a deliberate block manager, and avoid letting generic OS policies make token-path placement decisions implicitly.
bad path:
anonymous allocation -> page fault -> kernel migrates into slow tier -> decode jitter

better path:
runtime block manager -> explicit HBM/CXL placement -> predictable token latency

Vendor Reality Check

The CXL ecosystem is no longer purely theoretical, but it is also not yet as mature as PCIe SSDs or conventional DDR memory. The market is in the early deployment phase: modules, controllers, platform validation, and software stacks are converging, with adoption led by hyperscale, cloud, HPC, in-memory database, and AI inference use cases.

Status is summarized from public product pages and announcements available as of May 2026. Exact availability can vary by region, OEM platform, and customer qualification cycle.
Company Representative CXL Offering Status / Signal Relevant Use Case
Samsung CMM-D and CMM-H memory modules CMM-D sampling; CMM-H positioned as a DRAM + NAND CXL Type 3 device for tiered memory and persistence. Memory expansion, tiered memory, AI inference, analytics, in-memory databases.
Micron CZ120 CXL memory expansion module Publicly positioned for Supermicro Petascale platforms and CXL 2.0 adoption. Server memory expansion for AI, HPC, and memory-intensive platforms.
Astera Labs Leo CXL Smart Memory Controllers Controller and add-in-card platform for memory expansion, sharing, and pooling; public cloud validation signals emerging platform adoption. CXL controller layer, pooling, sharing, diagnostics, cloud server memory scaling.
Marvell Structera CXL controllers Interoperability-focused memory expansion controller platform aimed at high-capacity server memory. CXL controller silicon, memory expansion, near-memory compute direction.

The Tradeoff Map

The memory hierarchy for AI inference can be understood as a tradeoff between speed, capacity, cost, and complexity.

Approximate order-of-magnitude figures. Actual latency and bandwidth depend on GPU, CPU, memory media, PCIe/CXL generation, topology, switch hops, NUMA placement, firmware, and software scheduling.
Memory Tier Typical Bandwidth Approximate Latency Class Engineering Interpretation
GPU HBM3e Several TB/s per GPU, roughly ~1 TB/s+ per HBM stack class Lowest practical tier for GPU-resident data Best for model-critical state and hottest KV blocks.
Local CPU DDR5 ~40-60 GB/s per channel, platform aggregate in the hundreds of GB/s ~tens to low hundreds of ns CPU-local access class Cheap and mature, but crossing into GPU use adds interconnect cost.
CXL 2.0 / 3.0 Type 3 ~32-64 GB/s class per PCIe Gen5 x16 link, higher with aggregation Often modeled as a slower NUMA memory tier, roughly hundreds of ns depending on topology A warm tier: much slower than HBM, much more memory-like than SSD.
NVMe Gen5 SSD ~10-14 GB/s sequential class per high-end drive Microseconds to tens of microseconds through the block/storage path Excellent cold capacity tier, dangerous for active token-path KV reuse.
GPU HBM Fastest and most expensive. Best for the hottest data and strict latency paths.
CXL Memory Large and memory-like. A potential warm tier for KV cache and memory expansion.
CPU DRAM Widely available and cheaper than HBM, but slower across the CPU-GPU boundary.
NVMe / SSD Huge and cheap, but much slower. Better for cold spillover and persistence.
Photonic Links Future bandwidth and efficiency layer for scaling memory movement across systems.

When HBM Wins

Shorter contexts, strict latency targets, premium workloads, and deployments where cost is secondary to speed.

When CXL Becomes Interesting

Long context, high concurrency, large active sessions, and cases where buying more GPUs just for memory is inefficient.

When NVMe Helps

Cold cache, persistence, archival context, checkpointing, and workloads that can tolerate higher access latency.

When Software Wins

Compression, better attention patterns, and new model architectures can reduce the memory burden before hardware is added.

Conclusion

There is no single universal winner. The right architecture depends on workload. For a low-latency chatbot with short prompts, GPU HBM may be enough. For enterprise agents reading long documents while serving many users, HBM alone may become too expensive. For long-context workloads with moderate latency tolerance, DRAM or NVMe offload may be acceptable.

For high-concurrency enterprise inference, CXL is compelling because it offers a larger memory tier without falling all the way down to storage-like latency. It represents a serious attempt to introduce a new layer into the AI memory hierarchy: larger than HBM, faster than storage, more flexible than conventional server DRAM, and more standardized than proprietary memory fabrics.

The deeper point: AI inference is evolving from a pure compute problem into a memory architecture problem. Long-context, high-concurrency AI needs more than faster GPUs. It needs smarter memory.

The sharper stance is this: if context windows keep growing, if agent loops retain longer working state, and if enterprises expect many concurrent users with low latency, a middle memory tier becomes very hard to avoid. CXL is the most credible open-standard candidate for that tier in the near term.

The counter-case is equally clear. If model architectures reduce KV cache by an order of magnitude, if compression preserves quality at very low precision, or if GPU HBM capacity expands faster than inference demand, CXL remains important but more specialized. The market will be decided by the gap between model memory appetite and affordable HBM supply.

Sources and Notes

Performance figures are approximate engineering ranges, not guaranteed product specifications. They are included to establish order-of-magnitude boundaries between HBM, DRAM, CXL memory, and NVMe storage.