Standalone DRAM Rack

Disaggregated
Memory for
Long-Context AI

A pooled DRAM tier for GPU and TPU inference clusters — large enough for KV-cache, fast enough for block-level prefetch, programmable enough to avoid remote-memory stalls.

Reference Architecture · ~12 min read · Updated 2026

42U / 48U Memory Appliance

64–160 TB

DDR5 / MRDIMM
LPDDR5X blades

Memory Blade 01 — DDR5 RDIMM ×16

Memory Blade 02 — DDR5 RDIMM ×16

Memory Blade 03 — MRDIMM ×16

Memory Blade 04 — MRDIMM ×16

Memory Blade 05 — DDR5 RDIMM ×16

Memory Blade 06 — DDR5 RDIMM ×16

Memory Blade 07 — LPDDR5X ×12

Controller ASIC — Dual Fabric

64–160 TB

Usable rack
DRAM capacity

1–5 µs

Target block
access / DMA path

8–50 TB/s

Aggregate external
fabric class

20–60 kW

Rack power
envelope

01 — System Specification

One-page system spec

Numbers are design targets, not product claims. Validate against your actual fabric, controller, workload, and SLO requirements.

Layer	Reference target	Why it matters
Rack capacity	32–40 memory blades × 2–4 TB each = 64–160 TB usable	Large enough to absorb KV-cache, embeddings, long prompts, and multi-tenant working sets without buying more GPU nodes just for memory.
Blade memory	DDR5 RDIMM / MRDIMM / LPDDR5X-class packages; 8–16 memory channels per blade	Optimized for capacity and cost per GB, not HBM-class bandwidth.
External fabric	CXL 3.x fabric, PCIe 6-class switching, or ultra-low-latency RDMA / InfiniBand / Ethernet	CXL 3.0 adds advanced switching, fabric capabilities, peer-to-peer communication, and resource sharing; RDMA designs can expose explicit object transfer semantics.
Per-node bandwidth	128–512 GB/s target into an accelerator server	Enough for scheduled KV block movement, still far below local HBM — software placement matters.
Access granularity	256 KB–1 MB object pages for KV blocks; 2–16 MB for bulk prompt segments	Amortizes fabric latency and avoids cache-line-sized remote misses.
Cost goal	Materially lower $/GB than HBM; higher $/GB than SSD/HBF	Designed as a warm memory tier between accelerator HBM and storage.

02 — Topology

Memory rack beside accelerator racks

The rack should not behave like a transparent, slow extension of HBM. It should behave like a programmable warm-memory appliance. GPUs and TPUs keep hot tensors in local HBM, while the runtime issues explicit block transfers for warm KV-cache, embeddings, and prompt segments.

The winning abstraction is not "remote memory" — it is memory intent: the runtime tells the infrastructure what an object is, when it will be reused, how urgent it is, and where it should live next.

03 — KV-Cache Offload

What changes for ML engineers

For large-model inference, KV-cache can become the capacity limiter before raw compute saturates. A DRAM rack gives the runtime an explicit warm tier.

Long context

More tokens before HBM pressure dominates

For 70B-class and larger models, KV-cache can become the limiter before raw compute. A DRAM rack gives the runtime a warm tier for older prompt segments while keeping active decode windows in HBM.

Batching

Higher serving density

Schedulers can place sessions near their existing KV blocks, batch requests with compatible memory locality, and reduce memory stranding across accelerator nodes.

MoE / Agents

Better memory for sparse workflows

Mixture-of-experts, retrieval-heavy agents, and rerank pipelines often have bursty memory demand. Pooled DRAM absorbs working-set spikes without dedicating HBM to idle state.

KV object descriptor

KV_OBJECT_DESCRIPTOR · YAML-like reference schema

object_type:       KV_CACHE
model_id:          llama-class-70b-or-larger
session_id:        tenantA/request-39281
layer_range:       0..79
token_range:       8192..16384
precision:         fp8 | int8 | fp16
placement:         HBM_HOT | LOCAL_DRAM | RACK_DRAM | SSD_COLD
reuse_probability: 0.74
deadline_us:       150
replication:       metadata=2x, data=optional

Core rule: Do not make the accelerator randomly page-fault into remote memory. Move semantic blocks before the decode loop needs them.

04 — Software Stack

Control plane & runtime API

Two layers: a control plane that manages placement decisions, and a runtime API that exposes explicit object lifecycle operations to the serving framework.

Control plane

Inference scheduler → blade controller

Inference Scheduler
  ↓
KV / Object Placement Manager
  ↓
Memory Rack Allocator
  ↓
Fabric Driver / CXL / RDMA Path
  ↓
Blade Controller ASIC

Runtime API

Object lifecycle · C-style pseudocode

memrack_alloc(type, bytes, policy)
memrack_put(object_id, src, bytes)
memrack_get(object_id, dst, deadline_us)
memrack_prefetch(object_id, target_accel)
memrack_pin(object_id, duration_ms)
memrack_evict(object_id, target_tier)
memrack_release(object_id)

Controller ASIC responsibilities

Function	Description
Object lookup	Maps semantic object IDs to physical blade / channel / page locations.
DMA scheduling	Prioritizes transfers by decode deadline, QoS class, and fabric congestion.
Compression / quantized storage	Optional KV compression for cold or warm blocks where quality permits.
Encryption / isolation	Per-tenant keys, link encryption, secure erase, and object-level quotas.
RAS	ECC, poison forwarding, patrol scrub, page retirement, blade failover.

05 — Failure Modes

Gotchas and failure modes

Real deployments surface four categories of failure worth designing for explicitly.

Tail latency: Average remote DRAM latency is not enough. Decode performance is dominated by p99 fabric congestion, queueing, and scheduler misses.

Thundering herd: Many GPU racks may pull the same prompt/KV segments after prefill. The controller needs admission control, multicast for shared blocks, and per-tenant shaping.

Memory stranding: Pooling helps only if allocation spans enough hosts without creating hot blades. Placement must consider temperature, bandwidth, and tenant locality.

Failover: A blade failure should degrade latency, not crash serving. Metadata should be replicated; data replication should be policy-based because KV can sometimes be recomputed.

06 — Decision Tree

DRAM rack vs HBM-only

Both architectures are valid. The choice turns on working-set size, reuse patterns, and whether your serving stack can predict and schedule memory movement.

Use a DRAM rack when…

Long-context inference is HBM-capacity-bound.
KV-cache or embeddings are warm, reusable, and block-transfer-friendly.
Serving has many tenants and stranded memory across nodes.
The runtime can explicitly prefetch and demote objects.
Cost per GB matters more than nanosecond latency.

Stay HBM-only when…

The working set is small enough to remain hot in HBM.
The workload is dense training dominated by collectives and matmul.
The software stack cannot predict reuse or prefetch deadlines.
Remote-memory p99 latency would violate token latency SLOs.
Operational complexity is not justified by utilization gains.

07 — Research Anchors

Prior work and alignment

This architecture is aligned with several public trends: CXL fabric and resource sharing, memory pooling research, and LLM runtimes that treat KV-cache management as a first-class serving problem.

CXL 3.x

Adds advanced switching, fabric capabilities, peer-to-peer communication, and fine-grained resource sharing across compute domains. The standard's evolution directly enables disaggregated memory at rack scale.

Pond (Microsoft Research)

Demonstrates the cloud economics and performance challenges of CXL-based memory pooling in production-scale data centers, including TCO analysis for the warm tier.

vLLM / PagedAttention

Shows why KV-cache memory efficiency is central to high-throughput LLM serving. PagedAttention's block-based KV management is a natural fit for an object-aware DRAM rack.

08 — Final Architecture

The strongest design

The strongest design is not transparent remote paging. It is an object-aware DRAM rack with explicit runtime and compiler-managed movement. HBM remains the active compute tier; local DRAM handles staging and metadata; rack DRAM becomes warm semantic memory; SSD/HBF becomes the cold durable tier.

The winning abstraction is not "remote memory." It is memory intent: the runtime tells the infrastructure what an object is, when it will be reused, how urgent it is, and where it should live next.