Diving Deep into Thermal Observability: CPUs, GPUs, Boards, and Datacenter Reality

We have more thermal data than ever, but less coherence than we need. Modern systems expose temperatures, power limits, throttling state, clocks, board thermals, chassis airflow, and platform health across many layers. Every layer is useful. None of them, by itself, is enough.

Core thesis: thermal monitoring is no longer just a sensor-reading problem. It is now an observability problem that needs normalization, provenance, runtime correlation, and integration into monitoring systems.

The Problem

If you are debugging a server today, you often end up pulling data from several separate tools: nvidia-smi, rocm-smi, sensors, /sys/class/thermal, /sys/class/powercap, ipmitool, and sometimes dcgmi.

That works for one-off inspection, but it breaks down quickly when you want a stable, programmable, cross-vendor, multi-layer answer. The common failures are familiar:

There is no single normalized output.
Provenance gets lost once values are copied into scripts or dashboards.
Runtime identity and telemetry identity are often disconnected.
Silicon and environment data live in different tools.
Cross-vendor comparison turns into CLI archaeology.
Automation is built on brittle text parsing.

How It Is Done Today

Thermal monitoring is usually done in one of four broad styles, each useful, each incomplete.

1. Vendor CLI First

Many teams start here. NVIDIA users rely on nvidia-smi. AMD users rely on rocm-smi. Cluster operators may add dcgmi. This is good at device-level introspection, but it does not solve system-wide correlation.

2. Linux Interface First

This path uses generic kernel surfaces such as hwmon, thermal_zone, powercap, and cpufreq. It is broad within Linux, but often uneven across platforms and generations.

3. BMC / Platform First

Datacenter operators often start with ipmitool, Redfish, or vendor BMC dashboards. That gives useful environmental and chassis context, but it usually does not explain enough about the CPU or GPU.

4. Ad Hoc Script Glue

Many organizations end up with shell scripts, Python glue, custom exporters, and host-specific dashboards. That solves the immediate problem, but rarely survives hardware churn cleanly.

What Thermal Observatory Is

thermal-observatory is a hardware-aware thermal observability framework for:

CPU telemetry
GPU telemetry
board and chassis telemetry
platform and fleet-facing integration points

It is not trying to replace every vendor tool. It is trying to become the normalization and correlation layer above them.

Normalize

Collect from many thermal surfaces and expose one shared snapshot model.

Preserve Provenance

Keep per-metric source, units, timestamp, and error fields visible to consumers.

Correlate

Bridge runtime identity, silicon telemetry, and datacenter environment signals.

Export

Support structured JSON, text output, and Prometheus-oriented monitoring workflows.

Why This Is Needed Now

Heterogeneous Compute Is Normal

Modern nodes mix x86 or ARM CPUs with NVIDIA or AMD accelerators, Linux thermal interfaces, and BMC-managed platform sensors. Thermal tooling that only understands one layer is now incomplete by default.

Throttling Is Multi-Layer

A slowdown is not always just “the GPU got hot.” It might be local accelerator heat, fan saturation, inlet temperature, board cooling issues, PSU behavior, or rack-level thermal stress. If you only inspect silicon telemetry, you miss the wider context.

Observability Systems Need Stable Machine Output

Humans can tolerate ad hoc CLIs. Monitoring systems cannot. For automation, alerting, and dashboards, you need stable JSON, explicit units, explicit provenance, and export paths like Prometheus textfile output.

What It Addresses

The repo is designed to address a few specific gaps in the current tooling landscape.

One structured snapshot instead of many incompatible commands.
Per-metric provenance rather than hidden data lineage.
NVIDIA separation between NVML telemetry and CUDA runtime correlation.
Growth path from silicon-only telemetry to silicon-plus-environment telemetry.
Fixture-backed testing through TM_SYSROOT so Linux collectors can be validated without live hardware.

Architecture

At the center of the project is the normalized snapshot contract in include/thermal_monitor.h. That header models CPU packages, ARM clusters, NVIDIA GPUs, AMD GPUs, generic Linux sensors, board sensors, fans, and PSUs.

flowchart TD A["Linux kernel interfaces
hwmon / thermal_zone / powercap / cpufreq"] --> N["Normalized snapshot"] B["Vendor APIs
NVML / ROCm SMI / CUDA runtime"] --> N C["Platform telemetry
IPMI / Redfish"] --> N D["Fleet integration
DCGM"] --> N N --> J["Structured JSON with per-metric provenance"] N --> P["Prometheus textfile/stdout export"] N --> T["Text output for humans"]

The repository does not assume that one source is always enough. Instead it layers sources into a shared model and keeps the source visible in the output.

flowchart LR A["Authoritative vendor / kernel interface"] --> B["Normalized collector"] B --> C["Snapshot model"] C --> D["JSON / Prometheus / text output"] A1["Examples:
NVML
ROCm SMI
powercap
hwmon"] --> A A2["Fallbacks:
amdgpu hwmon
generic thermal_zone
runtime correlation"] --> B

Why The Structure Matters

include/                 Public snapshot model and API
src/cpu/                 x86 and arm64 CPU collectors
src/gpu/                 NVIDIA and AMD collectors
src/platform/            Linux paths, IPMI, Redfish scaffolds
src/format/              JSON, Prometheus, and text output
tests/                   Fixture-backed validation
samples/                 Example outputs
docs/                    Design, architecture, release, and blog docs

This structure is important because it separates collection from formatting, keeps the snapshot model central, and leaves space for platform and vendor growth without collapsing into ad hoc script glue.

Why The Repo Matters

This project aims to occupy a missing layer in the systems stack.

Vendor Tools Are Necessary But Narrow

Vendor tools are strong at device-level introspection, but they are not trying to be a multi-vendor, multi-layer normalization framework.

Generic Linux Interfaces Are Broad But Uneven

Linux gives broad access, but not always consistent semantics across hardware generations.

Datacenter Operators Need Correlation

Operators increasingly need to answer questions like: is this node throttling because the GPU hotspot is high, because inlet air is hot, or because the board and PSU are thermally stressed? That is not a one-tool question.

Researchers Need Programmable Output

Performance engineers, kernel developers, and MLOps teams need structured output they can feed into dashboards, analyzers, and automation pipelines.

Current Capabilities

As of v0.1.0, the repo supports:

versioned JSON schema
per-metric provenance in JSON
Prometheus export
Linux fixture-backed collector testing
NVIDIA telemetry via NVML
NVIDIA runtime correlation via CUDA
AMD telemetry via ROCm SMI and amdgpu fallback
Linux thermal interfaces such as hwmon, thermal_zone, powercap, and cpufreq
early datacenter telemetry scaffolds for IPMI, Redfish, and DCGM

The current release notes are documented in docs/releases-v0.1.0.md.

What It Does Not Claim Yet

It is important to be precise about the current state.

It does not yet claim complete production hardening.
It does not yet claim broad real-hardware validation across many machine classes.
It does not yet claim full Redfish JSON support.
It does not yet claim deep DCGM field ingestion.
It does not yet claim AMD runtime-correlation parity with CUDA.

That honesty is one of the project’s strengths. It is explicit about what is implemented, what is scaffolded, and what still needs fleet validation.

Where It Could Go

Several next steps would turn the current alpha into a much stronger operational framework:

real Redfish JSON parsing
stronger IPMI normalization
deeper DCGM field support
ROCm runtime correlation to match the CUDA path
real hardware sample captures from production-class systems
correlation hints and diagnostics in the JSON output
freshness and stale-data handling

The larger point: thermal telemetry is no longer just a sensor-reading problem. It is an observability problem. That means we need normalized structure, explicit provenance, layered data sources, monitoring integration, and room for platform-level reasoning.

Read The Repo

The project is still early, but it already shows why this layer is needed: not because we lack thermal interfaces, but because we lack a clean way to unify them.