M MAN\SH AI
◆ Deep Dive · Infrastructure

NIXL, NCCL, and UCX

What each one actually does, where they overlap, where they don't, and why a system can exhaust host memory long before GPU HBM looks unhealthy.

📖 ~12 min read 🔧 Distributed Inference 📈 GPU Systems ⚙️ Operators & Debuggers

The Short Version

NIXL is an inference-oriented data movement layer. NCCL is a GPU collective communication library. UCX is a communication framework and transport abstraction that can sit beneath higher-level systems. They solve related but different problems — and mixing them up leads to bad debugging decisions.

Questions Answered

  • How does NIXL differ from NCCL?
  • Where does UCX fit in the stack?
  • Why can host memory grow while GPU memory looks fine?
  • What should operators watch first?

01Definitions and the right mental model

The cleanest way to think about these three technologies is by asking what decision each one is responsible for. They are not peers at the same abstraction level — and that asymmetry is the first thing to internalize.

NIXL
Decides what inference data should move, between which memory or storage tiers, and with which backend. Oriented around point-to-point transfer in distributed inference workflows — especially KV cache movement across GPU, DRAM, and storage domains.
NCCL
Decides how groups of GPUs exchange and reduce data collectively. Its center of gravity is collectives like all-reduce, all-gather, reduce-scatter, broadcast, and related group communication patterns for multi-GPU or multi-node execution.
UCX
Decides which transport and protocol path to use for efficient messaging and remote-memory movement across RDMA, RoCE, TCP, shared memory, and GPU-aware paths. It is a communication substrate, not an inference policy layer.
Figure 1 — Abstraction layers & ownership
APPLICATION LAYER NIXL Inference Data Movement API NCCL Collective Communication Library Your application / serving framework (PyTorch, vLLM, TensorRT-LLM, SGLang…) UCX Unified Communication X — Transport Framework can use as backend some environments / plugins InfiniBand RDMA/Verbs RoCE RDMA over Ethernet Shared Mem intra-host GPU-aware GDR / NVLink NCCL internal transports (own net stack, NVLink, IB, NVLS…) KEY INSIGHTS • NIXL and NCCL operate at different abstraction levels • UCX is a transport substrate, not a peer of NIXL/NCCL • NCCL can bypass UCX entirely in many deployments • NIXL can use UCX as one of multiple pluggable backends
The biggest mistake people make is treating all three as interchangeable "GPU networking libraries." They are not peers at the same abstraction level, and conflating them makes both architecture decisions and debugging harder than they need to be.
NIXL thinks in inference objects KV cache blocks, tensors, tiers, staging, asynchronous movement, backend abstraction, disaggregated serving.
NCCL thinks in communicator groups Ranks, collectives, topology-aware bandwidth, GPU-to-GPU synchronization, ring/tree algorithms.
UCX thinks in transports Tag matching, active messages, RMA, GPU-aware transfers, registration caches, lanes, protocol selection.

02How NIXL differs from NCCL

NIXL and NCCL both move data in GPU systems, but they are optimized for fundamentally different workloads and communication shapes. The table below lays out the key dimensions.

Dimension NIXL NCCL
Primary job Inference-oriented point-to-point data movement across memory and storage tiers Topology-aware multi-GPU collective communication
Typical data KV cache blocks, tensors, staged inference data Gradients, activations, parameters, sharded tensors, collective payloads
Typical pattern Peer-to-peer transfer, tiered movement, asynchronous reads and writes All-reduce, broadcast, all-gather, reduce-scatter, send/recv
Optimization target Flexible movement across heterogeneous memory and storage paths Maximum bandwidth and low latency for GPU group communication
Storage awareness Yes — explicitly part of the design No — not its core problem
Backend model Pluggable backends (UCX, GDS, NVMe, etc.) Own net stack; plugins for specific environments
Inference workflow fit Very strong for disaggregated serving and KV-cache transfer Useful where inference uses tensor/expert/pipeline-parallel collectives
Figure 2 — Communication patterns: NIXL (point-to-point) vs NCCL (collectives)
NIXL — Point-to-Point / Tiered Prefill GPU Node KV cache Host DRAM (offload) Decode GPU Node direct GPU→GPU NVMe / SSD (long-term) NIXL coordinates NCCL — All-Reduce / Collective GPU 0 rank 0 GPU 1 rank 1 GPU 2 rank 2 GPU 3 rank 3 All-Reduce synchronize Every rank sends AND receives → global gradient synchronization

NIXL is not "NCCL for inference"

That shorthand is tempting, but it misleads. NCCL is fundamentally organized around collectives and communicators — every participant contributes and everyone receives the reduced result. NIXL is organized around a more general inference transfer problem: moving data rapidly across GPUs, host memory, and storage tiers using a unified interface and pluggable backends. The two libraries have almost no semantic overlap.

In practice: if you are debugging expert-parallel all-to-all behavior inside a training or tightly synchronized multi-rank execution, NCCL is usually at the center of the problem. If you are debugging disaggregated serving, KV cache handoff, prefill-to-decode transfer, or movement between GPU and non-GPU tiers, NIXL is much more likely to be in the middle of the story.

NCCL is not "just a transport" either

NCCL is much higher-level than UCX. It has its own communicator model, launch semantics, collective algorithms, topology logic, and GPU-centric performance strategy. Depending on environment and plugins it may use different underlying network paths — but the programming model you interact with is nothing like UCX's.

Debugging heuristic: If your workload is mostly "move this KV cache block from here to there," NIXL is the natural conceptual lens. If it is "synchronize or redistribute state across many ranks efficiently," NCCL is the natural lens. Swapping them produces confused logs and wasted hours.

03Where UCX fits

UCX is easiest to understand as a communication framework that exposes abstract messaging and memory-access primitives while automatically choosing efficient underlying transports. It supports fabrics like InfiniBand, RoCE, TCP, shared memory, and GPU-aware paths. It handles many of the details that make modern communication stacks hard: transport selection, protocol choice, registration behavior, memory-type awareness, and zero-copy fast paths.

Figure 3 — UCX internal architecture: APIs, layers, and transport selection
Caller (NIXL, MPI, custom app…) calls UCP_tag_send / ucp_put / ucp_am_send… UCX FRAMEWORK UCP API — Tag | Stream | AM | RMA | Atomics Unified high-level communication primitives Transport Selection Engine device type · memory type · BW · latency · caps ib_rc / ib_dc InfiniBand RDMA roce / mlx5 RDMA over Ethernet tcp / ud Fallback / WAN posix / cma Shared Memory cuda_ipc / gdr GPU-aware / GDRCopy Reg. Cache ⚠ host mem growth
What UCX is A framework with APIs for tag matching, stream send/receive, active messages, remote memory access (RMA/RDMA), and atomics — plus sophisticated transport selection logic underneath. It exposes a unified programming model over wildly different hardware.
What UCX is not Not an inference policy layer. Not a KV-cache manager. Not a collective-communication replacement for NCCL's main role. It does not know what a "KV cache block" is — it knows what a memory region, a transport lane, and a registered buffer are.

How UCX relates to NIXL

NIXL can use UCX as one backend among others. That means NIXL may request a transfer at the inference layer, while UCX handles the lower-level communication path used to realize that transfer. In that arrangement, NIXL owns the semantic decision ("move this KV cache slab to that decode node") while UCX owns the transport decision ("use the RoCE path on HCA 3 with GPU-aware zero-copy").

How UCX relates to NCCL

NCCL is not simply "built on UCX" in every setup. The relationship is looser and deployment-dependent. Some environments integrate NCCL with UCX through plugins or surrounding communication stacks, but NCCL's core identity remains a collective communication library with its own network layer. The safe statement is:

NIXLcan use UCX as a pluggable backend
NCCLcan be deployed alongside UCX-related paths in some environments
UCXtransport framework; not the semantic owner of either library

The most useful operator framing

If you are looking at host symptoms like registration growth, pinned memory, multi-rail fan-out, or GPU-aware RDMA path selection, UCX is often the place where the transport mechanics become visible. NIXL or NCCL may be the top-level library your application names, but UCX may be where the memory registration and lane-selection behavior actually materializes — and where the host memory bill arrives.

04Memory registration mechanics

Before we can understand why host memory blows up, we need to understand what memory registration actually is and why it is unavoidably expensive on modern hardware.

What "registering" a memory region means

When a NIC (HCA) wants to directly read from or write to a memory buffer — bypassing the CPU — it needs to know that the buffer will stay at a fixed physical address, that it has permission to access it, and how to translate virtual addresses to physical ones. Memory registration is the act of establishing all of that:

Pin the pages The OS is told the pages must not be swapped out or moved. They are "pinned" in physical memory. This consumes kernel VM state, not GPU HBM.
Build translation tables The HCA firmware and the IOMMU each receive a mapping from virtual to physical address ranges. Each entry lives in host DRAM and device firmware memory.
Create a memory region object (MR) The RDMA driver creates a kernel-side ib_mr object. UCX wraps this in its own uct_mem structure with reference counts, cache keys, and metadata.
Issue hardware credentials The NIC receives an L-Key (local key) and R-Key (remote key) — capability tokens that authorize future DMA. These live in NIC firmware tables with bounded capacity.
Figure 4 — What happens when a single GPU buffer is registered across 4 HCAs
GPU HBM 1× tensor buffer e.g. 4 GB KV cache slab 1 logical buffer HCA 0 NIC firmware tables MR + pin + IOMMU entry HCA 1 NIC firmware tables MR + pin + IOMMU entry HCA 2 NIC firmware tables MR + pin + IOMMU entry HCA 3 NIC firmware MR + pin + IOMMU Host DRAM — registration overhead accumulates here 4× MR kernel objects · 4× IOMMU mappings · 4× pinned page lists · UCX reg-cache entries → 4× cost for 1 logical buffer. Add a second MR mode: 8×. GPU HBM: stable still just 4 GB used looks "fine"

Why the registration cache makes it worse

UCX keeps a registration cache — a table mapping memory address ranges to existing MR handles — so it can reuse registrations rather than paying the ibv_reg_mr cost every time. This is a performance optimization, and it works. But it means memory regions stay registered even after a transfer completes, until the cache evicts them. The cost is not "pay once per transfer," it is "pay once and hold until eviction pressure." In a long-running inference server with growing KV cache churn, this cache grows.

The registration cache is both the solution and the problem. Without it, every transfer pays ibv_reg_mr latency (~microseconds to milliseconds depending on buffer size). With it, host memory grows steadily over time proportional to the number of distinct address ranges that have ever been transferred.

05Why host memory can blow up first

This is the part that surprises people most. A system can have apparently healthy GPU HBM usage and still be marching toward a host OOM. That is not contradictory. GPU memory accounting and host memory accounting are completely separate, and data movement libraries can create substantial host-side state even when the main payload lives entirely on the GPU.

Figure 5 — Typical memory growth pattern: GPU HBM stable, host DRAM climbing
0% 25% 50% 75% 100% 0h 2h 4h 6h 8h 10h Time (serving workload running) Memory used (%) OOM zone GPU HBM: ~65% stable Host DRAM climbing → GPU HBM Host DRAM (registration + control state)

The key idea: moving GPU data requires host-side control structures

When GPU buffers are used with RDMA-capable communication, the stack needs to register or pin memory and create mapping metadata so the NIC, kernel driver, IOMMU, and communication library can access those buffers safely and efficiently. That work is not "free," and the overhead lives largely on the host side.

GPU HBM Stores the payload: tensors, KV cache, activations, model state. This is what your GPU memory monitor shows. It can be stable while the host is under pressure.
Host DRAM Stores registration metadata, driver state, firmware page tables, pinned-page lists, UCX cache entries, orchestration objects, and per-rail duplicates.
NIC firmware Tracks memory region credentials (L-Keys, R-Keys), address translation structures, and transfer context. Capacity is limited and bounded by hardware.

Why multi-NIC systems are especially vulnerable

Suppose a host has one GPU buffer but eight reachable HCAs. If the software stack registers that buffer across all eight NICs for maximum bandwidth (a normal multi-rail configuration), you do not have one registration cost — you have eight. If an additional policy or mode causes a second memory-region representation per HCA, you now have sixteen units of registration overhead for what the application still thinks of as one logical buffer.

The punchline: GPU memory usage tells you how much payload lives on the GPU. It does not tell you how much host-side machinery was created to move that payload efficiently. These two numbers are almost completely unrelated in a multi-rail RDMA deployment.

What actually consumes host memory

  • Memory registration metadata (ib_mr kernel objects)
  • Pinned-page tracking structures
  • NIC firmware page tables and related accounting
  • Driver-side memory region objects
  • IOMMU and DMA mapping state
  • Transport-layer caches, including UCX registration cache
  • Per-rail or per-HCA duplication of otherwise similar state
  • NIXL/UCX orchestration objects per active transfer

What operators typically see (illustrative snapshot)

65%
22% left
high ⚠
high ⚠

A monitoring dashboard may show GPU HBM "healthy" at 65% while host memory is in crisis. The GPU metric is not wrong — it just does not capture registration overhead.

What not to conclude too early

Avoid these reasoning shortcuts:

“GPU memory looks fine, so this cannot be a communication problem.”
“The model is not near HBM limits, so the host OOM must be unrelated.”
“UCX or RDMA improves performance, so its overhead must be negligible.”
“One logical buffer means one host-side registration cost.”

06A practical debugging model

When you suspect NIXL, NCCL, or UCX is part of a memory-pressure issue, the most useful question is not "which library is bad?" The better question is:

# Layer ownership framework
Semantic layerWho decided what data to move?         (NIXL, NCCL, your code)
Transport layerWho picked the path and realized it?    (UCX, NCCL net, kernel RDMA)
Accounting layerWhose host-side state is growing?        (UCX reg-cache, kernel MRs, IOMMU)
If you observe… Look first at Why
KV cache handoff problems or inference-tier movement issues NIXL Closest to inference data movement semantics.
Collective slowdowns across GPUs or ranks NCCL That is NCCL's home territory.
Registration pressure, lane selection, transport duplication, GPU-aware RDMA behavior UCX and the RDMA stack Low-level communication mechanics typically become visible here.
Host OOM with normal GPU memory Host memory: MemAvailable, cgroups, PSI, reclaim, mlx5 MR counters Failure mode is often host-side accounting, not HBM exhaustion.
Slow transfer ramp-up on first use but fast thereafter UCX registration cache cold-start First-time ibv_reg_mr is expensive; cache warms after initial passes.
Degraded throughput that correlates with NIC count Multi-rail registration duplication More HCAs means more registration overhead per logical buffer.
Figure 6 — Debugging decision tree for GPU cluster memory issues
Memory pressure observed Is GPU HBM near limit? Yes Model / KV cache too large reduce batch / quant / offload No Is host MemAvailable falling? Yes ↓ Is transfer concurrency growing? Yes Check UCX reg-cache size mlx5 MR count, pinned pages → UCX/NIXL transport tuning No Check cgroup leaks, other processes NCCL issues? Check NCCL collective perf

Operator checklist

GPU-side signals
  • GPU HBM usage (nvidia-smi)
  • BAR1 memory usage
  • NVLink bandwidth saturation
Host-side signals (often overlooked)
  • MemAvailable from /proc/meminfo
  • Memory PSI (/proc/pressure/memory)
  • Reclaim counters (vmstat, sar -B)
  • Per-cgroup memory growth
RDMA / transport signals
  • mlx5 firmware MR page counters
  • UCX_STATS registration cache hit/miss
  • Per-HCA registration counts
Correlation signals
  • Rising host pressure vs. KV cache movement volume
  • Multi-rail fan-out degree vs. MR count
  • Time-since-start vs. host MemAvailable slope

The simplest accurate summary

NIXL can ask for inference-driven data movement. UCX can realize that movement across GPU-aware transports. NCCL solves a different communication problem centered on collectives. And none of these facts prevent the host from becoming the first place your system runs out of memory.

07Bottom-line takeaways

NIXL vs NCCL NIXL is about inference data movement across memory and storage tiers. NCCL is about collective communication across GPUs and ranks. They are not peers and they are not interchangeable debugging targets.
Where UCX fits UCX is the transport and communication framework that higher-level systems (like NIXL) can use to implement fast GPU-aware data movement. It is a substrate, not a sibling.
Why host memory can fail first Memory registration, page pinning, IOMMU mappings, and firmware accounting all live on the host side, even when the payload lives in GPU HBM. Multi-rail configurations multiply these costs per logical buffer.
What operators should monitor Healthy HBM graphs do not rule out a transport-driven host OOM. Always watch MemAvailable, PSI, reclaim, and RDMA MR counters alongside GPU metrics.
The registration cache tradeoff UCX's registration cache speeds up transfers by avoiding repeated ibv_reg_mr calls, but causes host memory to grow proportional to the number of distinct registered address ranges over the lifetime of the process.
The debugging mantra Ask: who owns the semantics, who owns the transport behavior, and who owns the host-side accounting that is growing? Answer those three questions before reaching for config knobs.

08Sources

This article is based on official project documentation and primary sources: