← All posts
DDR5 Linux Kernel BIOS / UEFI CXL HBM
Memory Systems Engineering
Part 2

Part 2: How Memory Timings Are
Actually Configured
in Modern Systems

22 min read · DRAM · Firmware · Linux · NUMA · CXL · Systems engineering deep-dive

You paid for DDR5-6000 CL30. Your system booted at DDR5-4800 CL40. Most engineers stop at "enable XMP" and move on. This piece explains what actually happens — across firmware, memory controllers, DDR training, Linux, and runtime orchestration — and why the gap matters more than ever in the CXL and HBM era.

01 — Foundation

The Key Mental Model

Firmware asks: "Can the memory operate safely?"
Linux asks: "How should memory be used efficiently?"

The distinction that unlocks everything else

One of the most persistent misconceptions in systems engineering is that the Linux kernel directly controls low-level DRAM timings — things like CAS latency, row activation delays, or precharge timing. These timings are configured much earlier during firmware bring-up and DRAM training, often before the kernel even has an address space.

By the time Linux boots, the DRAM controller is already fully trained and its timing registers are locked. Linux's job is entirely different: it orchestrates how that trained memory gets allocated, reclaimed, migrated, and tiered. Confusing these two roles leads to wasted tuning effort and missed optimization opportunities at both layers.

02 — Architecture

The Modern Memory Stack

Every modern system delegates memory concerns across six distinct layers. Understanding who owns what determines where to look when performance is wrong — or when you want to make it better.

DRAM ChipsStore bits · obey timing constraints
DDR PHYSignal integrity · training
Integrated Memory Controller (IMC)Schedules reads, writes, refreshes, banks
BIOS / UEFI / AGESA / TF-ATrains memory · programs timing registers
Linux KernelAllocation · NUMA · reclaim · tiering
ApplicationsGenerate the actual memory access patterns
Key insight

Firmware and Linux don't compete over memory — they operate on completely different planes. Firmware works in nanoseconds and clock cycles; Linux works in pages, nodes, and allocation decisions.

03 — Reference

The Critical Timing Parameters

Every DRAM timing parameter encodes a physical constraint: how long the silicon needs to safely complete an operation. Violating them risks data corruption; padding them wastes performance. Here's what each one governs.

Parameter What it governs Tighter is better? Notes
tCL (CAS Latency) READ command → data availability delay Yes The headline timing. Most quoted in marketing.
tRCD Row activation → column access delay Yes Critical for random access workloads with many row opens.
tRP Precharge delay before switching rows Yes Bottlenecks random access across banks.
tRAS Minimum row active duration Careful Too low destabilizes DRAM. Usually not worth aggressive tuning.
tRFC Refresh cycle completion time Yes Explodes on large DDR5 DIMMs. 128GB+ DIMMs can see 400–800ns tRFC.
tREFI Interval between refresh cycles Risk Higher boosts bandwidth but increases DRAM error risk (rowhammer).
tWR Write recovery time Yes Controls timing after a write before the row can be precharged.
Command Rate (1T/2T) Clock cycles before command issuance Careful 1T improves latency but reduces stability at high frequency.
Real-world note

Absolute latency in nanoseconds is what matters for workloads, not clock cycles alone. DDR5-6000 CL30 achieves ~10ns. DDR5-4800 CL40 achieves ~16.7ns. That 67% gap in absolute latency is far larger than the 25% bandwidth difference, and it dominates single-threaded workloads, gaming, and latency-sensitive databases.

04 — Profiles

JEDEC vs XMP / EXPO

Every DDR5 DIMM ships with an SPD EEPROM containing JEDEC-standardized timings — conservative defaults guaranteed to work in any compliant system. XMP (Intel) and EXPO (AMD) are separately stored overclocking profiles baked in by the memory manufacturer, operating outside the JEDEC standard.

Default profile
JEDEC DDR5
Frequency 4800 MT/s
CAS Latency CL40
Voltage 1.1V
Absolute latency ~16.7ns
Stability risk None

That ~6.7ns difference compounds heavily. Redis, PostgreSQL, game engines, and physics simulations that work heavily from CPU L3 and DRAM see 15–25% throughput improvements on tight XMP configs. The main risk is stability — always run MemTest86 or TestMem5 for several passes after enabling XMP/EXPO, particularly on high-density kits.

05 — Firmware

What Firmware Actually Does

DRAM training is the most underappreciated part of the memory stack. Before Linux boots, firmware runs a calibration sequence that can take 30–90 seconds on servers with many DIMMs. Here's every major stage.

1
Power-on & SPD Read

Firmware powers the DIMM and reads the Serial Presence Detect EEPROM to understand density, speed grade, ranks, width, and which XMP/EXPO profiles are present.

2
IMC Initialization

The Integrated Memory Controller is configured with the target frequency and base timing parameters. The DDR PHY is brought up and clocks are established.

3
Write Leveling

Because DRAM data strobe signals fly-by the memory bus, each DIMM sees the clock at a different phase. Write leveling finds the correct delay per-rank so write strobes align with the clock edge at each device.

4
Read Training & DQS Centering

Firmware sweeps the read DQS (data strobe) window, finding the optimal sample point. The goal is maximum setup/hold margin — maximizing the eye opening — for reliable reads at the target frequency.

5
Vref Calibration

Both the CA (command/address) Vref and DQ Vref are swept to find the center of the valid voltage window for reliable signaling.

6
ODT & ZQ Calibration

On-die termination is tuned to match the board's transmission line impedance and minimize reflections. ZQ calibration sets the output driver impedance, which drifts with temperature.

7
Timing Register Programming

All trained parameters are committed to the IMC's hardware registers. From this point on, the timing registers are effectively locked — Linux will inherit this configuration and cannot change the electrical parameters.

8
Handoff to Linux

Firmware maps usable memory regions into the system address space, passes memory topology information to the kernel via ACPI/SRAT tables, and transfers control. Linux never retrains the electrical link.

06 — Linux

Linux Memory Interfaces

Linux does not retrain DRAM electrically — but it exposes a rich set of interfaces that let you observe memory topology, monitor reliability, and control higher-level placement behavior. Knowing where to look is half the job.

/proc interfaces

/proc/meminfo
System-wide memory summary: total, free, available, buffered, cached, swap. The canonical first stop.
/proc/zoneinfo
Per-NUMA-zone memory accounting. Shows watermarks, free pages, and memory pressure per zone.
/proc/buddyinfo
Buddy allocator order histogram. Fragmentation is visible here before vm.compact_memory has run.
/proc/vmstat
VM event counters — page faults, swapped in/out, THP activity, direct reclaim events, NUMA migrations.
/proc/iomem
Physical memory map including reserved regions, PCI BARs, and ACPI tables.

/sys interfaces

/sys/devices/system/node/
NUMA topology — per-node distances, CPUs, and memory stats. Essential for multi-socket and CXL-attached memory analysis.
/sys/kernel/mm/
Transparent hugepage controls, NUMA balancing knobs, memory compaction, and MGLRU settings.
/sys/devices/system/edac/
Error Detection And Correction counters. ce_count (correctable) and ue_count (uncorrectable) are your DIMM health indicators.

Inspect — what are your sticks actually running at?

bash — hardware inspection
# Full DIMM details: manufacturer, speed, configured speed, size, slots
sudo dmidecode --type memory

# Decode SPD EEPROM directly: exact tCL, tRCD, tRP, tRAS from the chip
sudo decode-dimms

# Quick grep: configured vs rated speed (spot XMP not applied)
dmidecode -t memory | grep -E "Speed:|Type:" | sort | uniq -c

# NUMA topology: nodes, inter-node distances, CPUs per node
numactl --hardware

# Memory-related kernel messages during boot (training results, errors)
dmesg | grep -i 'memory\|ddr\|numa\|imc\|edac'

ECC health monitoring

Server operators — read this

Silently elevated correctable error counts are an early warning of a degrading DIMM. An uncorrectable error is a data integrity event — investigate immediately.

bash — ECC / EDAC
# Correctable errors (single-bit, ECC fixed) — rising count = degrading DIMM
cat /sys/devices/system/edac/mc/mc0/ce_count

# Uncorrectable errors — data may be corrupt. Investigate NOW.
cat /sys/devices/system/edac/mc/mc0/ue_count

# Per-DIMM slot breakdown (mc0/dimm0, dimm1, etc.)
grep -r "" /sys/devices/system/edac/mc/mc0/ 2>/dev/null | grep "ce_count\|ue_count"

# EDAC events in the kernel log
dmesg | grep -i edac

Live performance monitoring

bash — runtime monitoring
# Live: watch reclaim, major faults, NUMA migrations, THP events
watch -n1 'grep -E "pgmajfault|numa_pages_migrated|thp_fault_alloc|pswpin" /proc/vmstat'

# NUMA hit/miss ratio — cross-node misses tank latency
numastat -m

# Memory bandwidth and cache miss rate (system-wide, 5s sample)
perf stat -e cache-misses,cache-references,mem-loads,mem-stores -a sleep 5

# Per-process memory access profiling (requires perf with MEM support)
perf mem record -a sleep 10
perf mem report
07 — Controller

The Integrated Memory Controller

Linux does not speak directly to DRAM banks. Every read and write from the CPU goes through the IMC, which translates physical addresses into DRAM row/bank/column coordinates, schedules accesses across channels, handles refresh timing, and arbitrates between competing reads and writes.

Firmware programs the IMC's timing registers once during training. Linux then controls the higher-level decisions:

NUMA Placement

Which NUMA node receives allocations. Local-node placement minimizes access latency across all CPUs on that node.

Huge Pages

2MB and 1GB pages reduce TLB pressure dramatically for large working-set workloads. Transparent Huge Pages automate this.

Page Migration

NUMA balancing migrates pages to the node where they're most accessed. Critical on multi-socket servers.

MGLRU Reclaim

Multi-Gen LRU tracks page access generations rather than strict access order, leading to much smarter reclaim decisions.

CXL Tiering

Promotes hot pages to local DRAM; demotes cold pages to slower CXL-attached memory, expanding effective capacity.

Memory Pressure

Watermark-based reclaim, kswapd, and direct reclaim cooperate to free memory under pressure without stalling processes.

08 — Modern Linux

MGLRU and CXL Tiering

The most impactful changes in Linux memory management over the last three kernel generations are not about timing registers — they're about intelligent orchestration of a multi-tier memory hierarchy.

MGLRU (Multi-Generational LRU, mainlined in Linux 6.1) replaces the old active/inactive LRU split with a generational tracking model. Rather than a binary "recently accessed or not," MGLRU maintains a sliding window of access generations that more accurately reflects true workload working sets. The result is significantly reduced refault rates and better behavior under memory pressure — particularly for mixed workloads like databases with large buffer caches running alongside application servers.

CXL memory tiering changes the game for large-memory infrastructure. Compute Express Link lets systems attach terabytes of DRAM-like memory over PCIe at lower cost-per-bit than local DIMMs, but with higher latency (typically 2–4× vs local DDR5). Linux's tiering daemon promotes frequently-accessed pages to local DRAM and demotes cold pages to CXL memory automatically. For AI inference, vector databases, and in-memory caches, this can expand usable working memory by 4–8× without proportional cost.

Hot page promotion

Frequently-accessed cold-tier pages are migrated to local DRAM. The kernel tracks access frequency using idle-page tracking and hardware access bits.

Cold page demotion

Pages that haven't been accessed within a tunable aging window are demoted from local DRAM to CXL or NVMe. Happens asynchronously via memory_tier infrastructure.

Enabling and checking MGLRU

bash — MGLRU (Linux 6.1+)
# Check if MGLRU is active — 0xf means all features enabled
cat /sys/kernel/mm/lru_gen/enabled

# Enable MGLRU if not already on
echo y > /sys/kernel/mm/lru_gen/enabled

# Make it persistent across reboots
echo "kernel.mm.lru_gen.enabled = 0xf" >> /etc/sysctl.d/99-mglru.conf
sysctl -p /etc/sysctl.d/99-mglru.conf

# View current generation stats per NUMA node
cat /sys/kernel/mm/lru_gen/lru_gen
09 — Configuration

Performance Modes in BIOS

Beyond XMP/EXPO, server and workstation firmware expose additional memory configuration modes that significantly impact Linux memory behavior.

Mode What it does Best for
XMP / EXPO Enables manufacturer-tuned overclocking profile: higher frequency, tighter sub-timings, higher voltage. Gaming, workstations, latency-sensitive Linux workloads
Performance Mode Aggressive IMC tuning: tighter command scheduling, 1T command rate where stable. Single-socket high-frequency workloads
Memory Interleaving Distributes physical addresses across channels/DIMMs to maximize bandwidth utilization. Sequential bandwidth workloads: HPC, video encoding, ML training
Gear Down / Gear 2 Halves the command clock rate relative to data clock for stability at high data rates. DDR5-6400+ where Gear 1 is unstable
ECC Mode Enables error detection and correction using extra DRAM bits. Slight bandwidth cost. Server workloads, embedded, any production environment
Patrol Scrub Background ECC scrubbing proactively reads and rewrites every memory cell on a schedule. Long-running server environments with large DIMMs
10 — Runtime Tuning

Runtime Performance Knobs

These are the Linux-side interventions that actually move the needle on memory performance — all tunable without a reboot, all reversible. Start here before touching BIOS settings.

Swappiness — how aggressively Linux reclaims RAM

The default swappiness of 60 means Linux is comfortable moving memory to swap even when RAM isn't critically full. For latency-sensitive workloads and in-memory databases, dropping it close to zero keeps data in RAM longer.

bash — swappiness
# Check current value (default: 60)
cat /proc/sys/vm/swappiness

# Set immediately (no reboot needed)
# General servers / workstations: 10
# Databases / latency-critical: 1
sysctl -w vm.swappiness=1

# Make it permanent
echo "vm.swappiness=1" >> /etc/sysctl.d/99-memory-perf.conf
sysctl -p /etc/sysctl.d/99-memory-perf.conf

Hugepages — eliminating TLB pressure

Standard 4KB pages create enormous TLB pressure for large working-set workloads. 2MB hugepages reduce TLB misses dramatically. There are two approaches: Transparent Huge Pages (automatic, kernel-managed) and static hugepages (pre-allocated, guaranteed to apps that request them via mmap). Databases typically prefer madvise mode so they can opt in per-allocation; always can cause latency spikes during compaction.

bash — hugepages
# Check current hugepage state
cat /proc/meminfo | grep -i huge
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# THP mode options: always | madvise | never
# 'always'  — kernel decides everywhere (good default for most workloads)
# 'madvise' — only where app calls madvise(MADV_HUGEPAGE) — best for DBs
# 'never'   — disabled (for real-time/latency-deterministic workloads)
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# Pre-allocate static 2MB hugepages (512 × 2MB = 1GB reserved)
echo 512 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# Persistent static hugepages via sysctl
echo "vm.nr_hugepages=512" >> /etc/sysctl.d/99-memory-perf.conf

NUMA pinning — the highest-ROI tuning on multi-socket

Cross-NUMA-node memory accesses add 50–150ns of latency per access on typical dual-socket systems. Pinning a process to a node so its memory and CPUs are co-located is often the single biggest performance win available — and it costs nothing. Auto NUMA balancing helps but adds jitter; explicit pinning is better for latency-critical processes.

bash — NUMA pinning
# Check NUMA hit/miss ratio — high remote hits = pin your process
numastat -m

# Check auto NUMA balancing state (1 = on)
cat /proc/sys/kernel/numa_balancing

# Launch a process pinned to NUMA node 0 (CPUs + memory local)
numactl --cpunodebind=0 --membind=0 ./myapp

# Pin an already-running process (PID) to node 0's CPUs
taskset -pc 0-$(nproc --all) $(pidof myapp)

# Interleave memory across all nodes (good for bandwidth-bound workloads)
numactl --interleave=all ./myapp
Priority order for performance gains

In the author's experience across production Linux systems: NUMA pinning wins most often on multi-socket → hugepages next on large working-set apps → swappiness for anything memory-pressure-sensitive → MGLRU → then BIOS XMP/EXPO. Most engineers check BIOS first and Linux last. It should be the other way around.

12 — Inference Deep Dive

Memory Settings for Faster Inference

LLM inference is almost entirely memory-bound, not compute-bound. During the decode phase (generating each token), the GPU must stream the entire model's weight matrix through HBM on every single forward pass. At 70B parameters in BF16, that's 140GB of memory traffic per token generated. Every setting below targets one of three root causes: HBM bandwidth, CPU↔GPU transfer latency, or KV cache memory pressure.

During decode, a single H100 SXM (3.35 TB/s HBM) generates tokens at roughly ~150 tokens/sec for a 70B model. Every bottleneck outside HBM — PCIe transfers, NUMA misses, CPU stalls, swap — directly subtracts from that ceiling.

Memory bandwidth is the hard ceiling. Everything else is friction.

Layer 1: BIOS / Firmware — set once, high impact

These settings are configured before Linux boots. Most are in BIOS under Advanced → CPU Configuration or Advanced → Memory Configuration. On managed servers (Dell iDRAC, HPE iLO, Supermicro IPMI), they're also accessible via IPMI/Redfish.

Setting Recommended value Why it matters for inference
CPU Power Mode Maximum Performance Keeps all cores at max frequency. Avoids the latency spike when a sleeping core wakes to serve a CUDA callback or host-side dispatch.
C-States Disable (C0 only) Deep C-states cause multi-millisecond wakeup latency. A GPU kernel waiting on a CPU response stalls the entire decode step. For inference: disable C6/C7, keep C1E at most.
P-States / EIST Disable or set fixed max Frequency scaling adds jitter to PCIe DMA dispatch latency. Pin to max P-state for deterministic token latency (TTFT and inter-token latency).
NUMA (Node Interleaving) Disable interleaving Keep NUMA topology intact. Interleaving defeats NUMA-aware GPU placement. With interleaving off, Linux SRAT table correctly identifies which CPU RAM is local to which GPU — critical for vLLM's --numa-bind.
Memory Frequency Max rated / XMP CPU DRAM bandwidth matters for KV cache transfers when offloading layers to CPU (DeepSpeed ZeRO, FlexGen). Higher CPU DRAM BW = faster weight offload throughput.
Memory Interleaving (channel) Enable Interleave across DDR channels (not NUMA nodes) to maximize CPU DRAM bandwidth for prefill tokenization, host-side KV management, and weight staging.
PCIe Speed Gen5 x16 (force max) PCIe Gen5 x16 = 128 GB/s. Gen4 x16 = 64 GB/s. On multi-GPU without NVLink, all tensor-parallel communication crosses PCIe. Halving it halves TP throughput.
PCIe ACS (Access Control Services) Disable ACS routes all PCIe peer-to-peer traffic through the root complex, adding latency. Disabling allows GPU↔GPU direct DMA (GPUDirect P2P) without CPU involvement.
SR-IOV Disable unless needed Adds IOMMU overhead to every DMA transaction. Inference on bare metal: disable. Only enable if sharing GPUs across VMs (MIG + SR-IOV).
IOMMU Passthrough or Disable Full IOMMU translation adds ~100–300ns to every GPU DMA. Use iommu=pt (passthrough) in GRUB — keeps IOMMU active for stability but bypasses translation for DMA-capable devices.
Hyperthreading Workload-dependent For GPU-centric inference: disable. HT causes L3 cache thrashing between the Python inference process and OS threads, adding CPU-side jitter. For CPU inference (llama.cpp): enable and benchmark.
Patrol Scrub Reduce frequency Scrubbing competes with normal memory traffic. On large DDR5 DIMMs during latency-sensitive inference, reduce patrol scrub frequency or schedule it during off-peak windows.

Layer 2: Linux kernel — boot parameters

These go in /etc/default/grub under GRUB_CMDLINE_LINUX, then update-grub and reboot. Each parameter is targeted at eliminating a specific source of inference latency jitter.

/etc/default/grub — inference server kernel parameters
# Recommended GRUB_CMDLINE_LINUX for an inference server
# Apply: update-grub && reboot

GRUB_CMDLINE_LINUX="
  intel_iommu=on iommu=pt
  hugepagesz=1G hugepages=64 default_hugepagesz=1G
  transparent_hugepage=madvise
  numa_balancing=disable
  processor.max_cstate=1
  intel_idle.max_cstate=1
  idle=poll
  nohz_full=8-63
  rcu_nocbs=8-63
  isolcpus=8-63
  mitigations=off
"

# Parameter breakdown:
# intel_iommu=on iommu=pt  — IOMMU on but passthrough mode: GPUDirect works, translation skipped
# hugepagesz=1G hugepages=64 — 64GB of 1GB static hugepages for CUDA pinned memory pools
# transparent_hugepage=madvise — THP only where PyTorch/CUDA explicitly requests it
# numa_balancing=disable  — disable auto-balancing; use explicit numactl instead
# processor.max_cstate=1 intel_idle.max_cstate=1 — cap sleep depth; no C6/C7 wakeup latency
# idle=poll  — CPUs spin instead of sleeping; eliminates wakeup latency (burns power)
# nohz_full + rcu_nocbs — tickless cores 8-63; remove scheduler tick jitter from GPU workers
# isolcpus=8-63  — reserve cores 8-63 for inference; OS cannot schedule tasks there
# mitigations=off  — remove Spectre/Meltdown mitigations (only on trusted/isolated inference infra)
On mitigations=off

This removes Spectre/Meltdown kernel mitigations and can improve inference throughput by 5–15% by eliminating syscall overhead. Only appropriate on dedicated inference nodes not shared with untrusted workloads or multi-tenant environments. Never use on a machine that runs user code you don't control.

Layer 3: Runtime sysctl — apply without reboot

bash — inference sysctl profile
# Memory: keep inference process resident, never swap it out
sysctl -w vm.swappiness=1
sysctl -w vm.vfs_cache_pressure=50

# NUMA: disable auto-balancing (use numactl explicitly instead)
sysctl -w kernel.numa_balancing=0

# MGLRU: enable smarter reclaim (kernel 6.1+)
echo 0xf > /sys/kernel/mm/lru_gen/enabled

# CPU governor: force performance mode on all cores
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# PCIe: disable ASPM (Active State Power Management) — eliminates link re-training latency
echo performance > /sys/module/pcie_aspm/parameters/policy

# Make all sysctl settings persistent
cat >> /etc/sysctl.d/99-inference.conf << EOF
vm.swappiness=1
vm.vfs_cache_pressure=50
kernel.numa_balancing=0
EOF
sysctl -p /etc/sysctl.d/99-inference.conf

Layer 4: vLLM / runtime — memory-specific flags

bash — vLLM launch with memory-optimal settings
# Verify GPU-to-NUMA mapping before launching (SYS = bad, NV# = good)
nvidia-smi topo -m

# Pin each GPU worker to its local NUMA node (vLLM 0.4+)
# --numa-bind auto-detects GPU→NUMA and calls numactl per worker
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --max-num-batched-tokens 32768 \
  --enable-chunked-prefill \
  --numa-bind \
  --dtype bfloat16

# For CPU-RAM offload scenarios (model > GPU VRAM): pin CPU memory too
numactl --cpunodebind=0 --membind=0 \
  python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

Verification — confirm settings are active

bash — post-config verification checklist
# 1. Confirm CPU governor is performance on all cores
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# 2. Confirm C-state depth (should show max_cstate=1)
cat /sys/module/intel_idle/parameters/max_cstate

# 3. Confirm 1GB hugepages allocated
grep -i hugepage /proc/meminfo

# 4. Confirm IOMMU passthrough mode
dmesg | grep -i "iommu\|DMAR" | head -10

# 5. Confirm GPU PCIe link speed (should show Gen5 / 32 GT/s)
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv

# 6. Confirm GPU→NUMA topology (NV# = NVLink, SYS = PCIe-only — flag if unexpected)
nvidia-smi topo -m

# 7. Confirm no cross-NUMA memory allocation during inference
numastat -p $(pidof python) 2>/dev/null || numastat
Scenario Biggest win Expected gain
Single GPU, GPU-bound decode HBM bandwidth is the ceiling — no kernel tuning helps. Quantize to FP8/INT4. 2–4× throughput from quantization alone
Multi-GPU, PCIe (no NVLink) PCIe ACS disable + Gen5 verification + NUMA bind per GPU worker 15–30% TP throughput improvement
Multi-GPU, NVLink C-state disable + hugepages + NUMA bind. NVLink handles GPU↔GPU; CPU jitter is the bottleneck. 5–15% TTFT improvement
CPU offload (DeepSpeed / FlexGen) Max DDR5 frequency (XMP) + channel interleaving + 1GB hugepages + swappiness=1 1.5–2.5× offload throughput
CPU-only inference (llama.cpp) XMP/EXPO + NUMA bind + hugepages + performance governor + HT enabled 20–40% tokens/sec improvement
13 — Takeaways

What to Actually Do About It

The right intervention depends on where in the stack you're working. Here's the practical summary by audience.

For gamers & enthusiasts
Enable XMP or EXPO in BIOS, validate with MemTest86 (2+ passes) or TestMem5. Check that absolute latency in nanoseconds actually improved — not just MT/s. Consider manual subtiming tuning once stable.
For Linux performance engineers
Latency is now dominated by placement, topology, and NUMA affinity. Profile with perf mem and numastat before blaming DRAM timings. Hugepages and NUMA binding often win more than any BIOS toggle.
For AI infrastructure engineers
HBM locality, CXL tiering, hugepage pinning, DMA scheduling, and memory bandwidth partitioning dominate over raw CAS numbers. NUMA topology of GPU-to-CPU attachment matters more than JEDEC vs XMP.
For kernel engineers
MGLRU, memory tiering, memory_tier infrastructure, and CXL hot/cold demotion are the new frontier. Linux controls orchestration, not raw electrical timing — the abstraction is intentional.
For server/cloud operators
Enable ECC unconditionally. Monitor ce_count via EDAC. Enable patrol scrub on large DDR5 DIMMs. Watch tREFI settings — relaxed refresh on dense DIMMs can silently increase error rates.
For embedded developers
DRAM bring-up lives inside TF-A or U-Boot BSP packages. Training parameters are board-specific. Understand your DDR PHY's training output logs — they are the ground truth for signal integrity problems.

The future of memory optimization is not tighter CAS numbers on a BIOS screen.
It is coordination — across firmware, memory controllers, Linux, runtimes, accelerators, and workload orchestration.

Modern systems optimize topology, placement, tiering, bandwidth, DMA scheduling, and memory locality as a unified system — not a stack of independent layers. The engineers who understand all six layers have a durable advantage.