← All posts
DDR5 Linux Kernel BIOS / UEFI CXL HBM
Memory Systems Engineering

How Memory Timings Are
Actually Configured
in Modern Systems

18 min read · DRAM · Firmware · Linux · NUMA · CXL · Systems engineering deep-dive

You paid for DDR5-6000 CL30. Your system booted at DDR5-4800 CL40. Most engineers stop at "enable XMP" and move on. This piece explains what actually happens — across firmware, memory controllers, DDR training, Linux, and runtime orchestration — and why the gap matters more than ever in the CXL and HBM era.

01 — Foundation

The Key Mental Model

Firmware asks: "Can the memory operate safely?"
Linux asks: "How should memory be used efficiently?"

The distinction that unlocks everything else

One of the most persistent misconceptions in systems engineering is that the Linux kernel directly controls low-level DRAM timings — things like CAS latency, row activation delays, or precharge timing. These timings are configured much earlier during firmware bring-up and DRAM training, often before the kernel even has an address space.

By the time Linux boots, the DRAM controller is already fully trained and its timing registers are locked. Linux's job is entirely different: it orchestrates how that trained memory gets allocated, reclaimed, migrated, and tiered. Confusing these two roles leads to wasted tuning effort and missed optimization opportunities at both layers.

02 — Architecture

The Modern Memory Stack

Every modern system delegates memory concerns across six distinct layers. Understanding who owns what determines where to look when performance is wrong — or when you want to make it better.

DRAM ChipsStore bits · obey timing constraints
DDR PHYSignal integrity · training
Integrated Memory Controller (IMC)Schedules reads, writes, refreshes, banks
BIOS / UEFI / AGESA / TF-ATrains memory · programs timing registers
Linux KernelAllocation · NUMA · reclaim · tiering
ApplicationsGenerate the actual memory access patterns
Key insight

Firmware and Linux don't compete over memory — they operate on completely different planes. Firmware works in nanoseconds and clock cycles; Linux works in pages, nodes, and allocation decisions.

03 — Reference

The Critical Timing Parameters

Every DRAM timing parameter encodes a physical constraint: how long the silicon needs to safely complete an operation. Violating them risks data corruption; padding them wastes performance. Here's what each one governs.

Parameter What it governs Tighter is better? Notes
tCL (CAS Latency) READ command → data availability delay Yes The headline timing. Most quoted in marketing.
tRCD Row activation → column access delay Yes Critical for random access workloads with many row opens.
tRP Precharge delay before switching rows Yes Bottlenecks random access across banks.
tRAS Minimum row active duration Careful Too low destabilizes DRAM. Usually not worth aggressive tuning.
tRFC Refresh cycle completion time Yes Explodes on large DDR5 DIMMs. 128GB+ DIMMs can see 400–800ns tRFC.
tREFI Interval between refresh cycles Risk Higher boosts bandwidth but increases DRAM error risk (rowhammer).
tWR Write recovery time Yes Controls timing after a write before the row can be precharged.
Command Rate (1T/2T) Clock cycles before command issuance Careful 1T improves latency but reduces stability at high frequency.
Real-world note

Absolute latency in nanoseconds is what matters for workloads, not clock cycles alone. DDR5-6000 CL30 achieves ~10ns. DDR5-4800 CL40 achieves ~16.7ns. That 67% gap in absolute latency is far larger than the 25% bandwidth difference, and it dominates single-threaded workloads, gaming, and latency-sensitive databases.

04 — Profiles

JEDEC vs XMP / EXPO

Every DDR5 DIMM ships with an SPD EEPROM containing JEDEC-standardized timings — conservative defaults guaranteed to work in any compliant system. XMP (Intel) and EXPO (AMD) are separately stored overclocking profiles baked in by the memory manufacturer, operating outside the JEDEC standard.

Default profile
JEDEC DDR5
Frequency 4800 MT/s
CAS Latency CL40
Voltage 1.1V
Absolute latency ~16.7ns
Stability risk None

That ~6.7ns difference compounds heavily. Redis, PostgreSQL, game engines, and physics simulations that work heavily from CPU L3 and DRAM see 15–25% throughput improvements on tight XMP configs. The main risk is stability — always run MemTest86 or TestMem5 for several passes after enabling XMP/EXPO, particularly on high-density kits.

05 — Firmware

What Firmware Actually Does

DRAM training is the most underappreciated part of the memory stack. Before Linux boots, firmware runs a calibration sequence that can take 30–90 seconds on servers with many DIMMs. Here's every major stage.

1
Power-on & SPD Read

Firmware powers the DIMM and reads the Serial Presence Detect EEPROM to understand density, speed grade, ranks, width, and which XMP/EXPO profiles are present.

2
IMC Initialization

The Integrated Memory Controller is configured with the target frequency and base timing parameters. The DDR PHY is brought up and clocks are established.

3
Write Leveling

Because DRAM data strobe signals fly-by the memory bus, each DIMM sees the clock at a different phase. Write leveling finds the correct delay per-rank so write strobes align with the clock edge at each device.

4
Read Training & DQS Centering

Firmware sweeps the read DQS (data strobe) window, finding the optimal sample point. The goal is maximum setup/hold margin — maximizing the eye opening — for reliable reads at the target frequency.

5
Vref Calibration

Both the CA (command/address) Vref and DQ Vref are swept to find the center of the valid voltage window for reliable signaling.

6
ODT & ZQ Calibration

On-die termination is tuned to match the board's transmission line impedance and minimize reflections. ZQ calibration sets the output driver impedance, which drifts with temperature.

7
Timing Register Programming

All trained parameters are committed to the IMC's hardware registers. From this point on, the timing registers are effectively locked — Linux will inherit this configuration and cannot change the electrical parameters.

8
Handoff to Linux

Firmware maps usable memory regions into the system address space, passes memory topology information to the kernel via ACPI/SRAT tables, and transfers control. Linux never retrains the electrical link.

06 — Linux

Linux Memory Interfaces

Linux does not retrain DRAM electrically — but it exposes a rich set of interfaces that let you observe memory topology, monitor reliability, and control higher-level placement behavior. Knowing where to look is half the job.

/proc interfaces

/proc/meminfo
System-wide memory summary: total, free, available, buffered, cached, swap. The canonical first stop.
/proc/zoneinfo
Per-NUMA-zone memory accounting. Shows watermarks, free pages, and memory pressure per zone.
/proc/buddyinfo
Buddy allocator order histogram. Fragmentation is visible here before vm.compact_memory has run.
/proc/vmstat
VM event counters — page faults, swapped in/out, THP activity, direct reclaim events, NUMA migrations.
/proc/iomem
Physical memory map including reserved regions, PCI BARs, and ACPI tables.

/sys interfaces

/sys/devices/system/node/
NUMA topology — per-node distances, CPUs, and memory stats. Essential for multi-socket and CXL-attached memory analysis.
/sys/kernel/mm/
Transparent hugepage controls, NUMA balancing knobs, memory compaction, and MGLRU settings.
/sys/devices/system/edac/
Error Detection And Correction counters. ce_count (correctable) and ue_count (uncorrectable) are your DIMM health indicators.

Inspect — what are your sticks actually running at?

bash — hardware inspection
# Full DIMM details: manufacturer, speed, configured speed, size, slots
sudo dmidecode --type memory

# Decode SPD EEPROM directly: exact tCL, tRCD, tRP, tRAS from the chip
sudo decode-dimms

# Quick grep: configured vs rated speed (spot XMP not applied)
dmidecode -t memory | grep -E "Speed:|Type:" | sort | uniq -c

# NUMA topology: nodes, inter-node distances, CPUs per node
numactl --hardware

# Memory-related kernel messages during boot (training results, errors)
dmesg | grep -i 'memory\|ddr\|numa\|imc\|edac'

ECC health monitoring

Server operators — read this

Silently elevated correctable error counts are an early warning of a degrading DIMM. An uncorrectable error is a data integrity event — investigate immediately.

bash — ECC / EDAC
# Correctable errors (single-bit, ECC fixed) — rising count = degrading DIMM
cat /sys/devices/system/edac/mc/mc0/ce_count

# Uncorrectable errors — data may be corrupt. Investigate NOW.
cat /sys/devices/system/edac/mc/mc0/ue_count

# Per-DIMM slot breakdown (mc0/dimm0, dimm1, etc.)
grep -r "" /sys/devices/system/edac/mc/mc0/ 2>/dev/null | grep "ce_count\|ue_count"

# EDAC events in the kernel log
dmesg | grep -i edac

Live performance monitoring

bash — runtime monitoring
# Live: watch reclaim, major faults, NUMA migrations, THP events
watch -n1 'grep -E "pgmajfault|numa_pages_migrated|thp_fault_alloc|pswpin" /proc/vmstat'

# NUMA hit/miss ratio — cross-node misses tank latency
numastat -m

# Memory bandwidth and cache miss rate (system-wide, 5s sample)
perf stat -e cache-misses,cache-references,mem-loads,mem-stores -a sleep 5

# Per-process memory access profiling (requires perf with MEM support)
perf mem record -a sleep 10
perf mem report
07 — Controller

The Integrated Memory Controller

Linux does not speak directly to DRAM banks. Every read and write from the CPU goes through the IMC, which translates physical addresses into DRAM row/bank/column coordinates, schedules accesses across channels, handles refresh timing, and arbitrates between competing reads and writes.

Firmware programs the IMC's timing registers once during training. Linux then controls the higher-level decisions:

NUMA Placement

Which NUMA node receives allocations. Local-node placement minimizes access latency across all CPUs on that node.

Huge Pages

2MB and 1GB pages reduce TLB pressure dramatically for large working-set workloads. Transparent Huge Pages automate this.

Page Migration

NUMA balancing migrates pages to the node where they're most accessed. Critical on multi-socket servers.

MGLRU Reclaim

Multi-Gen LRU tracks page access generations rather than strict access order, leading to much smarter reclaim decisions.

CXL Tiering

Promotes hot pages to local DRAM; demotes cold pages to slower CXL-attached memory, expanding effective capacity.

Memory Pressure

Watermark-based reclaim, kswapd, and direct reclaim cooperate to free memory under pressure without stalling processes.

08 — Modern Linux

MGLRU and CXL Tiering

The most impactful changes in Linux memory management over the last three kernel generations are not about timing registers — they're about intelligent orchestration of a multi-tier memory hierarchy.

MGLRU (Multi-Generational LRU, mainlined in Linux 6.1) replaces the old active/inactive LRU split with a generational tracking model. Rather than a binary "recently accessed or not," MGLRU maintains a sliding window of access generations that more accurately reflects true workload working sets. The result is significantly reduced refault rates and better behavior under memory pressure — particularly for mixed workloads like databases with large buffer caches running alongside application servers.

CXL memory tiering changes the game for large-memory infrastructure. Compute Express Link lets systems attach terabytes of DRAM-like memory over PCIe at lower cost-per-bit than local DIMMs, but with higher latency (typically 2–4× vs local DDR5). Linux's tiering daemon promotes frequently-accessed pages to local DRAM and demotes cold pages to CXL memory automatically. For AI inference, vector databases, and in-memory caches, this can expand usable working memory by 4–8× without proportional cost.

Hot page promotion

Frequently-accessed cold-tier pages are migrated to local DRAM. The kernel tracks access frequency using idle-page tracking and hardware access bits.

Cold page demotion

Pages that haven't been accessed within a tunable aging window are demoted from local DRAM to CXL or NVMe. Happens asynchronously via memory_tier infrastructure.

Enabling and checking MGLRU

bash — MGLRU (Linux 6.1+)
# Check if MGLRU is active — 0xf means all features enabled
cat /sys/kernel/mm/lru_gen/enabled

# Enable MGLRU if not already on
echo y > /sys/kernel/mm/lru_gen/enabled

# Make it persistent across reboots
echo "kernel.mm.lru_gen.enabled = 0xf" >> /etc/sysctl.d/99-mglru.conf
sysctl -p /etc/sysctl.d/99-mglru.conf

# View current generation stats per NUMA node
cat /sys/kernel/mm/lru_gen/lru_gen
09 — Configuration

Performance Modes in BIOS

Beyond XMP/EXPO, server and workstation firmware expose additional memory configuration modes that significantly impact Linux memory behavior.

Mode What it does Best for
XMP / EXPO Enables manufacturer-tuned overclocking profile: higher frequency, tighter sub-timings, higher voltage. Gaming, workstations, latency-sensitive Linux workloads
Performance Mode Aggressive IMC tuning: tighter command scheduling, 1T command rate where stable. Single-socket high-frequency workloads
Memory Interleaving Distributes physical addresses across channels/DIMMs to maximize bandwidth utilization. Sequential bandwidth workloads: HPC, video encoding, ML training
Gear Down / Gear 2 Halves the command clock rate relative to data clock for stability at high data rates. DDR5-6400+ where Gear 1 is unstable
ECC Mode Enables error detection and correction using extra DRAM bits. Slight bandwidth cost. Server workloads, embedded, any production environment
Patrol Scrub Background ECC scrubbing proactively reads and rewrites every memory cell on a schedule. Long-running server environments with large DIMMs
10 — Runtime Tuning

Runtime Performance Knobs

These are the Linux-side interventions that actually move the needle on memory performance — all tunable without a reboot, all reversible. Start here before touching BIOS settings.

Swappiness — how aggressively Linux reclaims RAM

The default swappiness of 60 means Linux is comfortable moving memory to swap even when RAM isn't critically full. For latency-sensitive workloads and in-memory databases, dropping it close to zero keeps data in RAM longer.

bash — swappiness
# Check current value (default: 60)
cat /proc/sys/vm/swappiness

# Set immediately (no reboot needed)
# General servers / workstations: 10
# Databases / latency-critical: 1
sysctl -w vm.swappiness=1

# Make it permanent
echo "vm.swappiness=1" >> /etc/sysctl.d/99-memory-perf.conf
sysctl -p /etc/sysctl.d/99-memory-perf.conf

Hugepages — eliminating TLB pressure

Standard 4KB pages create enormous TLB pressure for large working-set workloads. 2MB hugepages reduce TLB misses dramatically. There are two approaches: Transparent Huge Pages (automatic, kernel-managed) and static hugepages (pre-allocated, guaranteed to apps that request them via mmap). Databases typically prefer madvise mode so they can opt in per-allocation; always can cause latency spikes during compaction.

bash — hugepages
# Check current hugepage state
cat /proc/meminfo | grep -i huge
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# THP mode options: always | madvise | never
# 'always'  — kernel decides everywhere (good default for most workloads)
# 'madvise' — only where app calls madvise(MADV_HUGEPAGE) — best for DBs
# 'never'   — disabled (for real-time/latency-deterministic workloads)
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# Pre-allocate static 2MB hugepages (512 × 2MB = 1GB reserved)
echo 512 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# Persistent static hugepages via sysctl
echo "vm.nr_hugepages=512" >> /etc/sysctl.d/99-memory-perf.conf

NUMA pinning — the highest-ROI tuning on multi-socket

Cross-NUMA-node memory accesses add 50–150ns of latency per access on typical dual-socket systems. Pinning a process to a node so its memory and CPUs are co-located is often the single biggest performance win available — and it costs nothing. Auto NUMA balancing helps but adds jitter; explicit pinning is better for latency-critical processes.

bash — NUMA pinning
# Check NUMA hit/miss ratio — high remote hits = pin your process
numastat -m

# Check auto NUMA balancing state (1 = on)
cat /proc/sys/kernel/numa_balancing

# Launch a process pinned to NUMA node 0 (CPUs + memory local)
numactl --cpunodebind=0 --membind=0 ./myapp

# Pin an already-running process (PID) to node 0's CPUs
taskset -pc 0-$(nproc --all) $(pidof myapp)

# Interleave memory across all nodes (good for bandwidth-bound workloads)
numactl --interleave=all ./myapp
Priority order for performance gains

In the author's experience across production Linux systems: NUMA pinning wins most often on multi-socket → hugepages next on large working-set apps → swappiness for anything memory-pressure-sensitive → MGLRU → then BIOS XMP/EXPO. Most engineers check BIOS first and Linux last. It should be the other way around.

11 — Takeaways

What to Actually Do About It

The right intervention depends on where in the stack you're working. Here's the practical summary by audience.

For gamers & enthusiasts
Enable XMP or EXPO in BIOS, validate with MemTest86 (2+ passes) or TestMem5. Check that absolute latency in nanoseconds actually improved — not just MT/s. Consider manual subtiming tuning once stable.
For Linux performance engineers
Latency is now dominated by placement, topology, and NUMA affinity. Profile with perf mem and numastat before blaming DRAM timings. Hugepages and NUMA binding often win more than any BIOS toggle.
For AI infrastructure engineers
HBM locality, CXL tiering, hugepage pinning, DMA scheduling, and memory bandwidth partitioning dominate over raw CAS numbers. NUMA topology of GPU-to-CPU attachment matters more than JEDEC vs XMP.
For kernel engineers
MGLRU, memory tiering, memory_tier infrastructure, and CXL hot/cold demotion are the new frontier. Linux controls orchestration, not raw electrical timing — the abstraction is intentional.
For server/cloud operators
Enable ECC unconditionally. Monitor ce_count via EDAC. Enable patrol scrub on large DDR5 DIMMs. Watch tREFI settings — relaxed refresh on dense DIMMs can silently increase error rates.
For embedded developers
DRAM bring-up lives inside TF-A or U-Boot BSP packages. Training parameters are board-specific. Understand your DDR PHY's training output logs — they are the ground truth for signal integrity problems.

The future of memory optimization is not tighter CAS numbers on a BIOS screen.
It is coordination — across firmware, memory controllers, Linux, runtimes, accelerators, and workload orchestration.

Modern systems optimize topology, placement, tiering, bandwidth, DMA scheduling, and memory locality as a unified system — not a stack of independent layers. The engineers who understand all six layers have a durable advantage.