The Key Mental Model
Firmware asks: "Can the memory operate safely?"
Linux asks: "How should memory be used efficiently?"
The distinction that unlocks everything else
One of the most persistent misconceptions in systems engineering is that the Linux kernel directly controls low-level DRAM timings — things like CAS latency, row activation delays, or precharge timing. These timings are configured much earlier during firmware bring-up and DRAM training, often before the kernel even has an address space.
By the time Linux boots, the DRAM controller is already fully trained and its timing registers are locked. Linux's job is entirely different: it orchestrates how that trained memory gets allocated, reclaimed, migrated, and tiered. Confusing these two roles leads to wasted tuning effort and missed optimization opportunities at both layers.
The Modern Memory Stack
Every modern system delegates memory concerns across six distinct layers. Understanding who owns what determines where to look when performance is wrong — or when you want to make it better.
Firmware and Linux don't compete over memory — they operate on completely different planes. Firmware works in nanoseconds and clock cycles; Linux works in pages, nodes, and allocation decisions.
The Critical Timing Parameters
Every DRAM timing parameter encodes a physical constraint: how long the silicon needs to safely complete an operation. Violating them risks data corruption; padding them wastes performance. Here's what each one governs.
| Parameter | What it governs | Tighter is better? | Notes |
|---|---|---|---|
| tCL (CAS Latency) | READ command → data availability delay | Yes | The headline timing. Most quoted in marketing. |
| tRCD | Row activation → column access delay | Yes | Critical for random access workloads with many row opens. |
| tRP | Precharge delay before switching rows | Yes | Bottlenecks random access across banks. |
| tRAS | Minimum row active duration | Careful | Too low destabilizes DRAM. Usually not worth aggressive tuning. |
| tRFC | Refresh cycle completion time | Yes | Explodes on large DDR5 DIMMs. 128GB+ DIMMs can see 400–800ns tRFC. |
| tREFI | Interval between refresh cycles | Risk | Higher boosts bandwidth but increases DRAM error risk (rowhammer). |
| tWR | Write recovery time | Yes | Controls timing after a write before the row can be precharged. |
| Command Rate (1T/2T) | Clock cycles before command issuance | Careful | 1T improves latency but reduces stability at high frequency. |
Absolute latency in nanoseconds is what matters for workloads, not clock cycles alone. DDR5-6000 CL30 achieves ~10ns. DDR5-4800 CL40 achieves ~16.7ns. That 67% gap in absolute latency is far larger than the 25% bandwidth difference, and it dominates single-threaded workloads, gaming, and latency-sensitive databases.
JEDEC vs XMP / EXPO
Every DDR5 DIMM ships with an SPD EEPROM containing JEDEC-standardized timings — conservative defaults guaranteed to work in any compliant system. XMP (Intel) and EXPO (AMD) are separately stored overclocking profiles baked in by the memory manufacturer, operating outside the JEDEC standard.
That ~6.7ns difference compounds heavily. Redis, PostgreSQL, game engines, and physics simulations that work heavily from CPU L3 and DRAM see 15–25% throughput improvements on tight XMP configs. The main risk is stability — always run MemTest86 or TestMem5 for several passes after enabling XMP/EXPO, particularly on high-density kits.
What Firmware Actually Does
DRAM training is the most underappreciated part of the memory stack. Before Linux boots, firmware runs a calibration sequence that can take 30–90 seconds on servers with many DIMMs. Here's every major stage.
Firmware powers the DIMM and reads the Serial Presence Detect EEPROM to understand density, speed grade, ranks, width, and which XMP/EXPO profiles are present.
The Integrated Memory Controller is configured with the target frequency and base timing parameters. The DDR PHY is brought up and clocks are established.
Because DRAM data strobe signals fly-by the memory bus, each DIMM sees the clock at a different phase. Write leveling finds the correct delay per-rank so write strobes align with the clock edge at each device.
Firmware sweeps the read DQS (data strobe) window, finding the optimal sample point. The goal is maximum setup/hold margin — maximizing the eye opening — for reliable reads at the target frequency.
Both the CA (command/address) Vref and DQ Vref are swept to find the center of the valid voltage window for reliable signaling.
On-die termination is tuned to match the board's transmission line impedance and minimize reflections. ZQ calibration sets the output driver impedance, which drifts with temperature.
All trained parameters are committed to the IMC's hardware registers. From this point on, the timing registers are effectively locked — Linux will inherit this configuration and cannot change the electrical parameters.
Firmware maps usable memory regions into the system address space, passes memory topology information to the kernel via ACPI/SRAT tables, and transfers control. Linux never retrains the electrical link.
Linux Memory Interfaces
Linux does not retrain DRAM electrically — but it exposes a rich set of interfaces that let you observe memory topology, monitor reliability, and control higher-level placement behavior. Knowing where to look is half the job.
/proc interfaces
vm.compact_memory has run./sys interfaces
ce_count (correctable) and ue_count (uncorrectable) are your DIMM health indicators.Inspect — what are your sticks actually running at?
# Full DIMM details: manufacturer, speed, configured speed, size, slots sudo dmidecode --type memory # Decode SPD EEPROM directly: exact tCL, tRCD, tRP, tRAS from the chip sudo decode-dimms # Quick grep: configured vs rated speed (spot XMP not applied) dmidecode -t memory | grep -E "Speed:|Type:" | sort | uniq -c # NUMA topology: nodes, inter-node distances, CPUs per node numactl --hardware # Memory-related kernel messages during boot (training results, errors) dmesg | grep -i 'memory\|ddr\|numa\|imc\|edac'
ECC health monitoring
Silently elevated correctable error counts are an early warning of a degrading DIMM. An uncorrectable error is a data integrity event — investigate immediately.
# Correctable errors (single-bit, ECC fixed) — rising count = degrading DIMM cat /sys/devices/system/edac/mc/mc0/ce_count # Uncorrectable errors — data may be corrupt. Investigate NOW. cat /sys/devices/system/edac/mc/mc0/ue_count # Per-DIMM slot breakdown (mc0/dimm0, dimm1, etc.) grep -r "" /sys/devices/system/edac/mc/mc0/ 2>/dev/null | grep "ce_count\|ue_count" # EDAC events in the kernel log dmesg | grep -i edac
Live performance monitoring
# Live: watch reclaim, major faults, NUMA migrations, THP events watch -n1 'grep -E "pgmajfault|numa_pages_migrated|thp_fault_alloc|pswpin" /proc/vmstat' # NUMA hit/miss ratio — cross-node misses tank latency numastat -m # Memory bandwidth and cache miss rate (system-wide, 5s sample) perf stat -e cache-misses,cache-references,mem-loads,mem-stores -a sleep 5 # Per-process memory access profiling (requires perf with MEM support) perf mem record -a sleep 10 perf mem report
The Integrated Memory Controller
Linux does not speak directly to DRAM banks. Every read and write from the CPU goes through the IMC, which translates physical addresses into DRAM row/bank/column coordinates, schedules accesses across channels, handles refresh timing, and arbitrates between competing reads and writes.
Firmware programs the IMC's timing registers once during training. Linux then controls the higher-level decisions:
Which NUMA node receives allocations. Local-node placement minimizes access latency across all CPUs on that node.
2MB and 1GB pages reduce TLB pressure dramatically for large working-set workloads. Transparent Huge Pages automate this.
NUMA balancing migrates pages to the node where they're most accessed. Critical on multi-socket servers.
Multi-Gen LRU tracks page access generations rather than strict access order, leading to much smarter reclaim decisions.
Promotes hot pages to local DRAM; demotes cold pages to slower CXL-attached memory, expanding effective capacity.
Watermark-based reclaim, kswapd, and direct reclaim cooperate to free memory under pressure without stalling processes.
MGLRU and CXL Tiering
The most impactful changes in Linux memory management over the last three kernel generations are not about timing registers — they're about intelligent orchestration of a multi-tier memory hierarchy.
MGLRU (Multi-Generational LRU, mainlined in Linux 6.1) replaces the old active/inactive LRU split with a generational tracking model. Rather than a binary "recently accessed or not," MGLRU maintains a sliding window of access generations that more accurately reflects true workload working sets. The result is significantly reduced refault rates and better behavior under memory pressure — particularly for mixed workloads like databases with large buffer caches running alongside application servers.
CXL memory tiering changes the game for large-memory infrastructure. Compute Express Link lets systems attach terabytes of DRAM-like memory over PCIe at lower cost-per-bit than local DIMMs, but with higher latency (typically 2–4× vs local DDR5). Linux's tiering daemon promotes frequently-accessed pages to local DRAM and demotes cold pages to CXL memory automatically. For AI inference, vector databases, and in-memory caches, this can expand usable working memory by 4–8× without proportional cost.
Frequently-accessed cold-tier pages are migrated to local DRAM. The kernel tracks access frequency using idle-page tracking and hardware access bits.
Pages that haven't been accessed within a tunable aging window are demoted from local DRAM to CXL or NVMe. Happens asynchronously via memory_tier infrastructure.
Enabling and checking MGLRU
# Check if MGLRU is active — 0xf means all features enabled cat /sys/kernel/mm/lru_gen/enabled # Enable MGLRU if not already on echo y > /sys/kernel/mm/lru_gen/enabled # Make it persistent across reboots echo "kernel.mm.lru_gen.enabled = 0xf" >> /etc/sysctl.d/99-mglru.conf sysctl -p /etc/sysctl.d/99-mglru.conf # View current generation stats per NUMA node cat /sys/kernel/mm/lru_gen/lru_gen
Performance Modes in BIOS
Beyond XMP/EXPO, server and workstation firmware expose additional memory configuration modes that significantly impact Linux memory behavior.
| Mode | What it does | Best for |
|---|---|---|
| XMP / EXPO | Enables manufacturer-tuned overclocking profile: higher frequency, tighter sub-timings, higher voltage. | Gaming, workstations, latency-sensitive Linux workloads |
| Performance Mode | Aggressive IMC tuning: tighter command scheduling, 1T command rate where stable. | Single-socket high-frequency workloads |
| Memory Interleaving | Distributes physical addresses across channels/DIMMs to maximize bandwidth utilization. | Sequential bandwidth workloads: HPC, video encoding, ML training |
| Gear Down / Gear 2 | Halves the command clock rate relative to data clock for stability at high data rates. | DDR5-6400+ where Gear 1 is unstable |
| ECC Mode | Enables error detection and correction using extra DRAM bits. Slight bandwidth cost. | Server workloads, embedded, any production environment |
| Patrol Scrub | Background ECC scrubbing proactively reads and rewrites every memory cell on a schedule. | Long-running server environments with large DIMMs |
Runtime Performance Knobs
These are the Linux-side interventions that actually move the needle on memory performance — all tunable without a reboot, all reversible. Start here before touching BIOS settings.
Swappiness — how aggressively Linux reclaims RAM
The default swappiness of 60 means Linux is comfortable moving memory to swap even when RAM isn't critically full. For latency-sensitive workloads and in-memory databases, dropping it close to zero keeps data in RAM longer.
# Check current value (default: 60) cat /proc/sys/vm/swappiness # Set immediately (no reboot needed) # General servers / workstations: 10 # Databases / latency-critical: 1 sysctl -w vm.swappiness=1 # Make it permanent echo "vm.swappiness=1" >> /etc/sysctl.d/99-memory-perf.conf sysctl -p /etc/sysctl.d/99-memory-perf.conf
Hugepages — eliminating TLB pressure
Standard 4KB pages create enormous TLB pressure for large working-set workloads.
2MB hugepages reduce TLB misses dramatically. There are two approaches:
Transparent Huge Pages (automatic, kernel-managed) and static hugepages
(pre-allocated, guaranteed to apps that request them via mmap).
Databases typically prefer madvise mode so they can opt in
per-allocation; always can cause latency spikes during compaction.
# Check current hugepage state cat /proc/meminfo | grep -i huge cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # THP mode options: always | madvise | never # 'always' — kernel decides everywhere (good default for most workloads) # 'madvise' — only where app calls madvise(MADV_HUGEPAGE) — best for DBs # 'never' — disabled (for real-time/latency-deterministic workloads) echo madvise > /sys/kernel/mm/transparent_hugepage/enabled # Pre-allocate static 2MB hugepages (512 × 2MB = 1GB reserved) echo 512 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # Persistent static hugepages via sysctl echo "vm.nr_hugepages=512" >> /etc/sysctl.d/99-memory-perf.conf
NUMA pinning — the highest-ROI tuning on multi-socket
Cross-NUMA-node memory accesses add 50–150ns of latency per access on typical dual-socket systems. Pinning a process to a node so its memory and CPUs are co-located is often the single biggest performance win available — and it costs nothing. Auto NUMA balancing helps but adds jitter; explicit pinning is better for latency-critical processes.
# Check NUMA hit/miss ratio — high remote hits = pin your process numastat -m # Check auto NUMA balancing state (1 = on) cat /proc/sys/kernel/numa_balancing # Launch a process pinned to NUMA node 0 (CPUs + memory local) numactl --cpunodebind=0 --membind=0 ./myapp # Pin an already-running process (PID) to node 0's CPUs taskset -pc 0-$(nproc --all) $(pidof myapp) # Interleave memory across all nodes (good for bandwidth-bound workloads) numactl --interleave=all ./myapp
In the author's experience across production Linux systems: NUMA pinning wins most often on multi-socket → hugepages next on large working-set apps → swappiness for anything memory-pressure-sensitive → MGLRU → then BIOS XMP/EXPO. Most engineers check BIOS first and Linux last. It should be the other way around.
What to Actually Do About It
The right intervention depends on where in the stack you're working. Here's the practical summary by audience.
perf mem and numastat before blaming DRAM timings. Hugepages and NUMA binding often win more than any BIOS toggle.memory_tier infrastructure, and CXL hot/cold demotion are the new frontier. Linux controls orchestration, not raw electrical timing — the abstraction is intentional.ce_count via EDAC. Enable patrol scrub on large DDR5 DIMMs. Watch tREFI settings — relaxed refresh on dense DIMMs can silently increase error rates.
The future of memory optimization is not tighter CAS numbers on a BIOS screen.
It is coordination — across firmware, memory controllers, Linux,
runtimes, accelerators, and workload orchestration.
Modern systems optimize topology, placement, tiering, bandwidth, DMA scheduling, and memory locality as a unified system — not a stack of independent layers. The engineers who understand all six layers have a durable advantage.