A pooled DRAM tier for GPU and TPU inference clusters — large enough for KV-cache, fast enough for block-level prefetch, programmable enough to avoid remote-memory stalls.
Numbers are design targets, not product claims. Validate against your actual fabric, controller, workload, and SLO requirements.
| Layer | Reference target | Why it matters |
|---|---|---|
| Rack capacity | 32–40 memory blades × 2–4 TB each = 64–160 TB usable | Large enough to absorb KV-cache, embeddings, long prompts, and multi-tenant working sets without buying more GPU nodes just for memory. |
| Blade memory | DDR5 RDIMM / MRDIMM / LPDDR5X-class packages; 8–16 memory channels per blade | Optimized for capacity and cost per GB, not HBM-class bandwidth. |
| External fabric | CXL 3.x fabric, PCIe 6-class switching, or ultra-low-latency RDMA / InfiniBand / Ethernet | CXL 3.0 adds advanced switching, fabric capabilities, peer-to-peer communication, and resource sharing; RDMA designs can expose explicit object transfer semantics. |
| Per-node bandwidth | 128–512 GB/s target into an accelerator server | Enough for scheduled KV block movement, still far below local HBM — software placement matters. |
| Access granularity | 256 KB–1 MB object pages for KV blocks; 2–16 MB for bulk prompt segments | Amortizes fabric latency and avoids cache-line-sized remote misses. |
| Cost goal | Materially lower $/GB than HBM; higher $/GB than SSD/HBF | Designed as a warm memory tier between accelerator HBM and storage. |
The rack should not behave like a transparent, slow extension of HBM. It should behave like a programmable warm-memory appliance. GPUs and TPUs keep hot tensors in local HBM, while the runtime issues explicit block transfers for warm KV-cache, embeddings, and prompt segments.
The winning abstraction is not "remote memory" — it is memory intent: the runtime tells the infrastructure what an object is, when it will be reused, how urgent it is, and where it should live next.
For large-model inference, KV-cache can become the capacity limiter before raw compute saturates. A DRAM rack gives the runtime an explicit warm tier.
For 70B-class and larger models, KV-cache can become the limiter before raw compute. A DRAM rack gives the runtime a warm tier for older prompt segments while keeping active decode windows in HBM.
Schedulers can place sessions near their existing KV blocks, batch requests with compatible memory locality, and reduce memory stranding across accelerator nodes.
Mixture-of-experts, retrieval-heavy agents, and rerank pipelines often have bursty memory demand. Pooled DRAM absorbs working-set spikes without dedicating HBM to idle state.
object_type: KV_CACHE model_id: llama-class-70b-or-larger session_id: tenantA/request-39281 layer_range: 0..79 token_range: 8192..16384 precision: fp8 | int8 | fp16 placement: HBM_HOT | LOCAL_DRAM | RACK_DRAM | SSD_COLD reuse_probability: 0.74 deadline_us: 150 replication: metadata=2x, data=optional
Two layers: a control plane that manages placement decisions, and a runtime API that exposes explicit object lifecycle operations to the serving framework.
Inference Scheduler ↓ KV / Object Placement Manager ↓ Memory Rack Allocator ↓ Fabric Driver / CXL / RDMA Path ↓ Blade Controller ASIC
memrack_alloc(type, bytes, policy) memrack_put(object_id, src, bytes) memrack_get(object_id, dst, deadline_us) memrack_prefetch(object_id, target_accel) memrack_pin(object_id, duration_ms) memrack_evict(object_id, target_tier) memrack_release(object_id)
| Function | Description |
|---|---|
| Object lookup | Maps semantic object IDs to physical blade / channel / page locations. |
| DMA scheduling | Prioritizes transfers by decode deadline, QoS class, and fabric congestion. |
| Compression / quantized storage | Optional KV compression for cold or warm blocks where quality permits. |
| Encryption / isolation | Per-tenant keys, link encryption, secure erase, and object-level quotas. |
| RAS | ECC, poison forwarding, patrol scrub, page retirement, blade failover. |
Real deployments surface four categories of failure worth designing for explicitly.
Both architectures are valid. The choice turns on working-set size, reuse patterns, and whether your serving stack can predict and schedule memory movement.
This architecture is aligned with several public trends: CXL fabric and resource sharing, memory pooling research, and LLM runtimes that treat KV-cache management as a first-class serving problem.
The strongest design is not transparent remote paging. It is an object-aware DRAM rack with explicit runtime and compiler-managed movement. HBM remains the active compute tier; local DRAM handles staging and metadata; rack DRAM becomes warm semantic memory; SSD/HBF becomes the cold durable tier.
The winning abstraction is not "remote memory." It is memory intent: the runtime tells the infrastructure what an object is, when it will be reused, how urgent it is, and where it should live next.