MAN\SH AI / Writings

· AI Memory Infrastructure · 12 min read

Standalone DRAM Rack

Disaggregated
Memory for
Long-Context AI

A pooled DRAM tier for GPU and TPU inference clusters — large enough for KV-cache, fast enough for block-level prefetch, programmable enough to avoid remote-memory stalls.

Reference Architecture · ~12 min read · Updated 2026
42U / 48U Memory Appliance
64–160 TB
DDR5 / MRDIMM
LPDDR5X blades
Memory Blade 01 — DDR5 RDIMM ×16
Memory Blade 02 — DDR5 RDIMM ×16
Memory Blade 03 — MRDIMM ×16
Memory Blade 04 — MRDIMM ×16
Memory Blade 05 — DDR5 RDIMM ×16
Memory Blade 06 — DDR5 RDIMM ×16
Memory Blade 07 — LPDDR5X ×12
Controller ASIC — Dual Fabric
64–160 TB
Usable rack
DRAM capacity
1–5 µs
Target block
access / DMA path
8–50 TB/s
Aggregate external
fabric class
20–60 kW
Rack power
envelope

One-page system spec

Numbers are design targets, not product claims. Validate against your actual fabric, controller, workload, and SLO requirements.

LayerReference targetWhy it matters
Rack capacity 32–40 memory blades × 2–4 TB each = 64–160 TB usable Large enough to absorb KV-cache, embeddings, long prompts, and multi-tenant working sets without buying more GPU nodes just for memory.
Blade memory DDR5 RDIMM / MRDIMM / LPDDR5X-class packages; 8–16 memory channels per blade Optimized for capacity and cost per GB, not HBM-class bandwidth.
External fabric CXL 3.x fabric, PCIe 6-class switching, or ultra-low-latency RDMA / InfiniBand / Ethernet CXL 3.0 adds advanced switching, fabric capabilities, peer-to-peer communication, and resource sharing; RDMA designs can expose explicit object transfer semantics.
Per-node bandwidth 128–512 GB/s target into an accelerator server Enough for scheduled KV block movement, still far below local HBM — software placement matters.
Access granularity 256 KB–1 MB object pages for KV blocks; 2–16 MB for bulk prompt segments Amortizes fabric latency and avoids cache-line-sized remote misses.
Cost goal Materially lower $/GB than HBM; higher $/GB than SSD/HBF Designed as a warm memory tier between accelerator HBM and storage.

Memory rack beside accelerator racks

The rack should not behave like a transparent, slow extension of HBM. It should behave like a programmable warm-memory appliance. GPUs and TPUs keep hot tensors in local HBM, while the runtime issues explicit block transfers for warm KV-cache, embeddings, and prompt segments.

The winning abstraction is not "remote memory" — it is memory intent: the runtime tells the infrastructure what an object is, when it will be reused, how urgent it is, and where it should live next.

Rack-scale disaggregated memory topology Hot compute stays local — warm memory is explicitly scheduled across the fabric GPU / TPU Rack A HBM Hot Set GPU DMA Engine Runtime Scheduler Prefetch Manager Accelerator Rack B + Rack C/D/N... Fabric Plane 0 CXL 3.x / RDMA / Optical PRIMARY PATH Fabric Plane 1 Redundant path + failover STANDBY / MULTIPATH Standalone DRAM Rack 64–160 TB warm memory tier Memory Blade (DDR5) Memory Blade (MRDIMM) Memory Blade (DDR5) Memory Blade (LPDDR5X) Controller ASIC + ECC SSD / HBF Cold Tier durable, slow, cheap prefetch / recall object-aware DMA evict / cold-tier

What changes for ML engineers

For large-model inference, KV-cache can become the capacity limiter before raw compute saturates. A DRAM rack gives the runtime an explicit warm tier.

Long context

More tokens before HBM pressure dominates

For 70B-class and larger models, KV-cache can become the limiter before raw compute. A DRAM rack gives the runtime a warm tier for older prompt segments while keeping active decode windows in HBM.

Batching

Higher serving density

Schedulers can place sessions near their existing KV blocks, batch requests with compatible memory locality, and reduce memory stranding across accelerator nodes.

MoE / Agents

Better memory for sparse workflows

Mixture-of-experts, retrieval-heavy agents, and rerank pipelines often have bursty memory demand. Pooled DRAM absorbs working-set spikes without dedicating HBM to idle state.

KV object descriptor

KV_OBJECT_DESCRIPTOR · YAML-like reference schema
object_type:       KV_CACHE
model_id:          llama-class-70b-or-larger
session_id:        tenantA/request-39281
layer_range:       0..79
token_range:       8192..16384
precision:         fp8 | int8 | fp16
placement:         HBM_HOT | LOCAL_DRAM | RACK_DRAM | SSD_COLD
reuse_probability: 0.74
deadline_us:       150
replication:       metadata=2x, data=optional
Core rule: Do not make the accelerator randomly page-fault into remote memory. Move semantic blocks before the decode loop needs them.

Control plane & runtime API

Two layers: a control plane that manages placement decisions, and a runtime API that exposes explicit object lifecycle operations to the serving framework.

Control plane

Inference scheduler → blade controller
Inference SchedulerKV / Object Placement ManagerMemory Rack AllocatorFabric Driver / CXL / RDMA PathBlade Controller ASIC

Runtime API

Object lifecycle · C-style pseudocode
memrack_alloc(type, bytes, policy)
memrack_put(object_id, src, bytes)
memrack_get(object_id, dst, deadline_us)
memrack_prefetch(object_id, target_accel)
memrack_pin(object_id, duration_ms)
memrack_evict(object_id, target_tier)
memrack_release(object_id)

Controller ASIC responsibilities

FunctionDescription
Object lookupMaps semantic object IDs to physical blade / channel / page locations.
DMA schedulingPrioritizes transfers by decode deadline, QoS class, and fabric congestion.
Compression / quantized storageOptional KV compression for cold or warm blocks where quality permits.
Encryption / isolationPer-tenant keys, link encryption, secure erase, and object-level quotas.
RASECC, poison forwarding, patrol scrub, page retirement, blade failover.

Gotchas and failure modes

Real deployments surface four categories of failure worth designing for explicitly.

Tail latency: Average remote DRAM latency is not enough. Decode performance is dominated by p99 fabric congestion, queueing, and scheduler misses.
Thundering herd: Many GPU racks may pull the same prompt/KV segments after prefill. The controller needs admission control, multicast for shared blocks, and per-tenant shaping.
Memory stranding: Pooling helps only if allocation spans enough hosts without creating hot blades. Placement must consider temperature, bandwidth, and tenant locality.
Failover: A blade failure should degrade latency, not crash serving. Metadata should be replicated; data replication should be policy-based because KV can sometimes be recomputed.

DRAM rack vs HBM-only

Both architectures are valid. The choice turns on working-set size, reuse patterns, and whether your serving stack can predict and schedule memory movement.

Use a DRAM rack when…

  • Long-context inference is HBM-capacity-bound.
  • KV-cache or embeddings are warm, reusable, and block-transfer-friendly.
  • Serving has many tenants and stranded memory across nodes.
  • The runtime can explicitly prefetch and demote objects.
  • Cost per GB matters more than nanosecond latency.

Stay HBM-only when…

  • The working set is small enough to remain hot in HBM.
  • The workload is dense training dominated by collectives and matmul.
  • The software stack cannot predict reuse or prefetch deadlines.
  • Remote-memory p99 latency would violate token latency SLOs.
  • Operational complexity is not justified by utilization gains.

Prior work and alignment

This architecture is aligned with several public trends: CXL fabric and resource sharing, memory pooling research, and LLM runtimes that treat KV-cache management as a first-class serving problem.

01
CXL 3.x
Adds advanced switching, fabric capabilities, peer-to-peer communication, and fine-grained resource sharing across compute domains. The standard's evolution directly enables disaggregated memory at rack scale.
02
Pond (Microsoft Research)
Demonstrates the cloud economics and performance challenges of CXL-based memory pooling in production-scale data centers, including TCO analysis for the warm tier.
03
vLLM / PagedAttention
Shows why KV-cache memory efficiency is central to high-throughput LLM serving. PagedAttention's block-based KV management is a natural fit for an object-aware DRAM rack.

The strongest design

The strongest design is not transparent remote paging. It is an object-aware DRAM rack with explicit runtime and compiler-managed movement. HBM remains the active compute tier; local DRAM handles staging and metadata; rack DRAM becomes warm semantic memory; SSD/HBF becomes the cold durable tier.

The winning abstraction is not "remote memory." It is memory intent: the runtime tells the infrastructure what an object is, when it will be reused, how urgent it is, and where it should live next.