We have more thermal data than ever, but less coherence than we need. Modern systems expose temperatures, power limits, throttling state, clocks, board thermals, chassis airflow, and platform health across many layers. Every layer is useful. None of them, by itself, is enough.
The Problem
If you are debugging a server today, you often end up pulling data from several separate tools:
nvidia-smi, rocm-smi, sensors, /sys/class/thermal,
/sys/class/powercap, ipmitool, and sometimes dcgmi.
That works for one-off inspection, but it breaks down quickly when you want a stable, programmable, cross-vendor, multi-layer answer. The common failures are familiar:
- There is no single normalized output.
- Provenance gets lost once values are copied into scripts or dashboards.
- Runtime identity and telemetry identity are often disconnected.
- Silicon and environment data live in different tools.
- Cross-vendor comparison turns into CLI archaeology.
- Automation is built on brittle text parsing.
How It Is Done Today
Thermal monitoring is usually done in one of four broad styles, each useful, each incomplete.
1. Vendor CLI First
Many teams start here. NVIDIA users rely on nvidia-smi. AMD users rely on
rocm-smi. Cluster operators may add dcgmi. This is good at device-level
introspection, but it does not solve system-wide correlation.
2. Linux Interface First
This path uses generic kernel surfaces such as hwmon, thermal_zone,
powercap, and cpufreq. It is broad within Linux, but often uneven across
platforms and generations.
3. BMC / Platform First
Datacenter operators often start with ipmitool, Redfish, or vendor BMC dashboards. That
gives useful environmental and chassis context, but it usually does not explain enough about the CPU or GPU.
4. Ad Hoc Script Glue
Many organizations end up with shell scripts, Python glue, custom exporters, and host-specific dashboards. That solves the immediate problem, but rarely survives hardware churn cleanly.
What Thermal Observatory Is
thermal-observatory is a hardware-aware thermal observability framework for:
- CPU telemetry
- GPU telemetry
- board and chassis telemetry
- platform and fleet-facing integration points
It is not trying to replace every vendor tool. It is trying to become the normalization and correlation layer above them.
Normalize
Collect from many thermal surfaces and expose one shared snapshot model.
Preserve Provenance
Keep per-metric source, units, timestamp, and error fields visible to consumers.
Correlate
Bridge runtime identity, silicon telemetry, and datacenter environment signals.
Export
Support structured JSON, text output, and Prometheus-oriented monitoring workflows.
Why This Is Needed Now
Heterogeneous Compute Is Normal
Modern nodes mix x86 or ARM CPUs with NVIDIA or AMD accelerators, Linux thermal interfaces, and BMC-managed platform sensors. Thermal tooling that only understands one layer is now incomplete by default.
Throttling Is Multi-Layer
A slowdown is not always just “the GPU got hot.” It might be local accelerator heat, fan saturation, inlet temperature, board cooling issues, PSU behavior, or rack-level thermal stress. If you only inspect silicon telemetry, you miss the wider context.
Observability Systems Need Stable Machine Output
Humans can tolerate ad hoc CLIs. Monitoring systems cannot. For automation, alerting, and dashboards, you need stable JSON, explicit units, explicit provenance, and export paths like Prometheus textfile output.
What It Addresses
The repo is designed to address a few specific gaps in the current tooling landscape.
- One structured snapshot instead of many incompatible commands.
- Per-metric provenance rather than hidden data lineage.
- NVIDIA separation between NVML telemetry and CUDA runtime correlation.
- Growth path from silicon-only telemetry to silicon-plus-environment telemetry.
- Fixture-backed testing through
TM_SYSROOTso Linux collectors can be validated without live hardware.
Architecture
At the center of the project is the normalized snapshot contract in include/thermal_monitor.h. That header models CPU packages, ARM clusters, NVIDIA GPUs, AMD GPUs, generic Linux sensors, board sensors, fans, and PSUs.
hwmon / thermal_zone / powercap / cpufreq"] --> N["Normalized snapshot"] B["Vendor APIs
NVML / ROCm SMI / CUDA runtime"] --> N C["Platform telemetry
IPMI / Redfish"] --> N D["Fleet integration
DCGM"] --> N N --> J["Structured JSON with per-metric provenance"] N --> P["Prometheus textfile/stdout export"] N --> T["Text output for humans"]
The repository does not assume that one source is always enough. Instead it layers sources into a shared model and keeps the source visible in the output.
NVML
ROCm SMI
powercap
hwmon"] --> A A2["Fallbacks:
amdgpu hwmon
generic thermal_zone
runtime correlation"] --> B
Why The Structure Matters
include/ Public snapshot model and API
src/cpu/ x86 and arm64 CPU collectors
src/gpu/ NVIDIA and AMD collectors
src/platform/ Linux paths, IPMI, Redfish scaffolds
src/format/ JSON, Prometheus, and text output
tests/ Fixture-backed validation
samples/ Example outputs
docs/ Design, architecture, release, and blog docs
This structure is important because it separates collection from formatting, keeps the snapshot model central, and leaves space for platform and vendor growth without collapsing into ad hoc script glue.
Why The Repo Matters
This project aims to occupy a missing layer in the systems stack.
Vendor Tools Are Necessary But Narrow
Vendor tools are strong at device-level introspection, but they are not trying to be a multi-vendor, multi-layer normalization framework.
Generic Linux Interfaces Are Broad But Uneven
Linux gives broad access, but not always consistent semantics across hardware generations.
Datacenter Operators Need Correlation
Operators increasingly need to answer questions like: is this node throttling because the GPU hotspot is high, because inlet air is hot, or because the board and PSU are thermally stressed? That is not a one-tool question.
Researchers Need Programmable Output
Performance engineers, kernel developers, and MLOps teams need structured output they can feed into dashboards, analyzers, and automation pipelines.
Current Capabilities
As of v0.1.0, the repo supports:
- versioned JSON schema
- per-metric provenance in JSON
- Prometheus export
- Linux fixture-backed collector testing
- NVIDIA telemetry via NVML
- NVIDIA runtime correlation via CUDA
- AMD telemetry via ROCm SMI and
amdgpufallback - Linux thermal interfaces such as
hwmon,thermal_zone,powercap, andcpufreq - early datacenter telemetry scaffolds for IPMI, Redfish, and DCGM
The current release notes are documented in docs/releases-v0.1.0.md.
What It Does Not Claim Yet
It is important to be precise about the current state.
- It does not yet claim complete production hardening.
- It does not yet claim broad real-hardware validation across many machine classes.
- It does not yet claim full Redfish JSON support.
- It does not yet claim deep DCGM field ingestion.
- It does not yet claim AMD runtime-correlation parity with CUDA.
That honesty is one of the project’s strengths. It is explicit about what is implemented, what is scaffolded, and what still needs fleet validation.
Where It Could Go
Several next steps would turn the current alpha into a much stronger operational framework:
- real Redfish JSON parsing
- stronger IPMI normalization
- deeper DCGM field support
- ROCm runtime correlation to match the CUDA path
- real hardware sample captures from production-class systems
- correlation hints and diagnostics in the JSON output
- freshness and stale-data handling
Read The Repo
- Repository
- README
- Public Model
- Design Notes
- Architecture Notes
- Datacenter Telemetry Direction
- Release Notes
The project is still early, but it already shows why this layer is needed: not because we lack thermal interfaces, but because we lack a clean way to unify them.