GPU Infra Deep Dives
H100 / H200 Thermal Analysis
Deep Dive · GPU Infrastructure

Why HBM Thermal Throttling Is Silent: Reading the Tea Leaves in nvidia-smi

For GPU Infra Engineers, vLLM Operators, H100/H200 Clusters · ~3,800 words · Updated April 2026 H100 SXM5 H200 SXM B200 Early Notes
TL;DR for Operators

On Hopper, nvidia-smi will happily show GPU Temp: 60°C while your HBM3e silently throttles at 95°C+, dropping memory bandwidth from 3.1 TB/s to 2.3 TB/s and spiking your vLLM p99 latency by 30%+. The core-temp sensor and power-throttle flags are no longer sufficient to diagnose memory-bound inference stalls. You need NVML's NVML_TEMPERATURE_MEMORY enum or DCGM field 252. This post explains the full signal path — why NVIDIA hides it, how to detect it across all three methods, and how to act on it before users notice.

1. The Problem: Your vLLM Server Got 30% Slower and nvidia-smi Looks Fine

If you operate H100 or H200 clusters running memory-bandwidth-bound workloads — vLLM, TGI, TensorRT-LLM — you've likely seen this ghost: p99 latency jumps from 250ms to 340ms, GPU utilization stays pegged at 100%, Power Draw is stable, and nvidia-smi dmon shows sm 100% mem 100% temp 58°C. No HW Slowdown. No Thermal Throttle flag. Nothing.

You reboot the node, latency recovers. You blame the Kubernetes scheduler, launch index fragmentation, or the phases of the moon. Three days later, it happens again — always on the same node, always after two to three hours of sustained prefill traffic.

What you're seeing is silent HBM thermal throttling. Starting with Hopper, NVIDIA decoupled HBM thermal management from the primary GPU thermal domain exposed by default tooling. The HBM3/HBM3e stacks have their own DTS sensors, trip points, and performance reduction states that do not surface in nvidia-smi --query-gpu=temperature.gpu.

1.1 Why This Happens on Hopper But Not Ampere

The A100 used HBM2e with a 2.5D CoWoS interposer and relatively conservative power density. The HBM stacks shared a thermal profile close enough to the GPU die that a single "GPU Temperature" metric was defensible as a proxy for overall thermal health. Hopper changed this radically:

At these densities, the delta between GPU die edge temperature and HBM junction temperature can exceed 35°C under sustained 3 TB/s+ traffic. When the GPU die reads 65°C via its on-die thermal sensor, the hottest HBM stack — typically the one nearest the die edge — can be at 98°C and actively throttling. If NVIDIA surfaced HBM temp in the default view, every cloud customer would file SEV-1s about "GPUs running at 95°C." Yet 98°C is within HBM3e spec from SK Hynix. So they hid it. The result is that operators are flying blind.

H200 SXM package cross-section showing HBM3e stacks and thermal domains Cross-section of H200 SXM package. Six HBM3e stacks (S0–S5) and the GH100 die sit on a silicon interposer. Two distinct thermal sensor paths are shown: the default nvidia-smi path reads the GPU die sensor only, while NVML_TEMPERATURE_MEMORY reads per-stack DTS sensors routed through GSP-RM. The diagram shows that the HBM temperature can significantly exceed the GPU die temperature under load. SILICON INTERPOSER (CoWoS) HBM3e Stack 0 98°C THROTTLING HBM3e Stack 1 93°C near limit HBM3e Stack 2 89°C nominal GH100 Die GPU Temp 65°C nvidia-smi sees this HBM3e Stack 3 83°C nominal HBM3e Stack 4 80°C nominal HBM3e Stack 5 78°C nominal DTS sensor masked in nvidia-smi NVML / DCGM only → all 6 stacks NVML_TEMPERATURE _MEMORY max across all stacks ΔT = 33°C between Stack 0 and GPU die Stack gradient: proximity to die edge Hot (near die edge) Cool (far from die edge)
Figure 1. H200 SXM package cross-section (schematic). The GH100 die sensor reads 65°C — the value nvidia-smi reports. Stack 0, physically nearest the die edge, reads 98°C and is actively throttling. NVIDIA routes per-stack DTS data exclusively through NVML/DCGM, masking it from the default view. The 33°C ΔT between die and hottest stack is typical under sustained 3+ TB/s HBM bandwidth.

2. The nvidia-smi Reporting Gap: A100 vs H100 vs H200

Run nvidia-smi -q -d TEMPERATURE on all three generations. The difference is structural, not cosmetic.

FieldA100 SXM4 80GBH100 SXM5 80GBH200 SXM 141GB
GPU Current Temp52°C61°C63°C
GPU Shutdown Temp89°C90°C90°C
GPU Slowdown Temp86°C87°C87°C
GPU Max Operating Temp85°C85°C85°C
Memory Current Temp50°C (reported)N/A (masked)N/A (masked)
GPU Target TemperatureNot SupportedNot SupportedNot Supported
HBM Slowdown Trip PointN/A (HBM2e)~95°C (NVML only)~95°C (NVML only)
DCGM Field 252 availableNoYes (DCGM ≥3.1.7)Yes (DCGM ≥3.1.7)
Key Observation

On Hopper, Memory Current Temp returns N/A in nvidia-smi despite the sensor existing and the driver actively reading it. This is a deliberate policy choice, not a hardware limitation. The driver reads all stack sensors via GSP-RM every 100ms — it simply does not forward the value to the default query path.

The throttle flag situation is equally misleading. Check throttle reasons with nvidia-smi -q -d PERFORMANCE:

# H100 under HBM thermal stress — driver < 535.104.12
$ nvidia-smi -q -d PERFORMANCE | grep -A5 "HW Slowdown"
    HW Slowdown                 : Not Active
    HW Thermal Slowdown         : Not Active
    HW Power Brake Slowdown     : Not Active

# Yet bandwidth is down 28% per nsight-systems.
# MEMCLK has dropped from 1593 MHz to 1193 MHz.
# The throttle is real. The flag is lying.

HW Thermal Slowdown refers exclusively to GPU die thermal throttling. HBM throttling — which reduces MEMCLK and increases refresh rate — did not flip any public throttle flag prior to driver 535.104.12.

3. The Full Thermal Throttle Signal Path

Understanding why monitoring breaks requires understanding the full signal chain from HBM die sensor to your Grafana alert. The chain has two branches — the visible default branch that ends at nvidia-smi, and the invisible branch that actually controls HBM performance.

Full thermal throttle signal path on Hopper Flowchart showing how HBM thermal signals flow from DTS sensors in the HBM3e stacks through the GSP-RM firmware controller. Two paths emerge: the masked path that hides data from nvidia-smi default queries, and the accessible path through NVML and DCGM. The performance impact path shows how memory clock reduction leads to bandwidth reduction, longer kernel runtimes, and vLLM latency spikes. HBM3e Stacks 0–5 per-stack DTS sensors, polled 100ms GSP-RM Thermal Controller computes max(stack temps), applies thermal policy Temp > threshold? YES Reduce MEMCLK 1593 → 1297 MHz (–19%) Increase tREFI Rate refresh bank overhead ↑ Effective BW –26% 3.1 TB/s → 2.3 TB/s NO Full bandwidth normal operation nvidia-smi default HBM temp masked → N/A NVML + DCGM F252 real temp, all stacks masked exposed
Figure 2. Full signal path: HBM DTS sensors → GSP-RM thermal controller → performance actions. The left branch (red) shows what happens when throttling triggers — MEMCLK drops 19% and effective bandwidth falls 26%. The right masked branch shows why default nvidia-smi never reflects this. NVML/DCGM is the only reliable visibility path.

3.1 GSP-RM Is the Gatekeeper

Starting with driver 525, Hopper runs GSP-RM (GPU System Processor Resource Manager), where a dedicated 64-bit ARM Cortex-A core on the GPU handles clocks, voltage, and thermal policy without CPU involvement. GSP implements HBM thermal averaging across all stacks. If any single stack exceeds NVML_TEMPERATURE_THRESHOLD_MEM_MAX — typically 95°C for H100 SXM and H200 SXM — the GSP steps down MEMCLK in 100 MHz decrements every 500ms until temperature stabilizes.

Critically, GSP did not set NVML_CLOCKS_THROTTLE_REASON_HW_THERMAL for memory-only events until driver 535.104.12. Before that version, the only public indicator was polling nvmlDeviceGetTemperature with the enum NVML_TEMPERATURE_MEMORY — which itself was added in NVML 12.525. On driver 550+, nvidia-smi may expose the throttle flag as a boolean, but still does not expose the raw HBM temperature in the default query.

3.2 Per-Stack Variance: Why the Aggregate Masks Hotspots

NVML's NVML_TEMPERATURE_MEMORY returns the maximum across all stacks. On H200 with 6 stacks, we routinely observe 10–13°C variance between stacks on the same package. Stack 0 (physically closest to the die edge on the interposer) runs hottest due to shorter thermal via paths and higher interposer current density from the die-to-stack bus routing.

H200 Gotcha

If DCGM field 252 reports 96°C, Stack 0 is likely at 98–99°C and already throttling at its trip point. The aggregate max does not tell you which stack is the bottleneck. If you have access to per-stack vendor tooling (NVIDIA internal tools or certain IPMI BMC interfaces on DGX systems), monitoring Stack 0 specifically will give you 2–3°C earlier warning than the aggregate.

4. Detecting HBM Throttling: Three Methods

Since default nvidia-smi lies by omission, we have three methods to query reality. Ordered by ease of deployment.

4.1 Method 1 — NVML Direct Query (Best Accuracy)

NVML 12.525+ exposes NVML_TEMPERATURE_MEMORY. Returns N/A on Ampere — safe to deploy on mixed fleets.

// thermal_poll.c
// Compile: gcc thermal_poll.c -lnvidia-ml -o thermal_poll
// Requires: NVML >= 12.525, driver >= 525

#include <nvml.h>
#include <stdio.h>
#include <unistd.h>
#include <time.h>

int main() {
    nvmlInit_v2();
    nvmlDevice_t dev;
    nvmlDeviceGetHandleByIndex_v2(0, &dev);

    printf("timestamp,hbm_temp_c,gpu_temp_c,mem_clk_mhz,max_mem_clk_mhz,throttle_pct\n");

    for (int i = 0; i < 600; i++) {
        unsigned int hbm_t = 0, gpu_t = 0, mem_clk = 0, max_mem_clk = 0;

        nvmlDeviceGetTemperature(dev, NVML_TEMPERATURE_MEMORY, &hbm_t);
        nvmlDeviceGetTemperature(dev, NVML_TEMPERATURE_GPU, &gpu_t);
        nvmlDeviceGetClockInfo(dev, NVML_CLOCK_MEM, &mem_clk);
        nvmlDeviceGetMaxClockInfo(dev, NVML_CLOCK_MEM, &max_mem_clk);

        float throttle_pct = (max_mem_clk > 0)
            ? (1.0f - (float)mem_clk / max_mem_clk) * 100.0f : 0.0f;

        printf("%d,%u,%u,%u,%u,%.1f\n",
               i, hbm_t, gpu_t, mem_clk, max_mem_clk, throttle_pct);

        // Alert threshold
        if (hbm_t >= 95) {
            fprintf(stderr, "[ALERT] t=%d HBM throttling: %u°C, MEMCLK %u/%u MHz (%.1f%% reduced)\n",
                    i, hbm_t, mem_clk, max_mem_clk, throttle_pct);
        }
        fflush(stdout);
        sleep(1);
    }
    nvmlShutdown();
    return 0;
}

4.2 Method 2 — DCGM Field 252 (Fleet Monitoring, Recommended)

DCGM is the recommended path for fleet-scale monitoring. Field 252 DCGM_FI_DEV_MEM_TEMP was silently added in DCGM 3.1.7 without announcement. Provides 1 Hz samples of max HBM stack temp.

# One-shot spot check
dcgmi dmon -e 252,203,204 -c 1
# 252 = HBM temp, 203 = GPU die temp, 204 = Memory clock

# Continuous 60-sample log during load
dcgmi dmon -e 252,203,204 -c 60 >> hbm_thermal_$(hostname).log

# Sample output on hot H200 SXM:
# GPU 0
#   DCGM_FI_DEV_MEM_TEMP      96
#   DCGM_FI_DEV_GPU_TEMP      63
#   DCGM_FI_DEV_MEM_CLOCK     1297   ← already throttled from 1593

# Add to Prometheus DCGM exporter (dcp-metrics-included.csv):
DCGM_FI_DEV_MEM_TEMP,gauge,HBM temperature max across stacks (Celsius).
DCGM_FI_DEV_MEM_CLOCK,gauge,Memory clock frequency (MHz).

4.3 Method 3 — Infer from Clock + BW Counters (No New Binaries)

If you can't redeploy agents, infer throttling by correlating fb_memory_clock against achieved bandwidth. HBM thermal throttle drops MEMCLK before touching voltage or SM clocks — this is the earliest detectable footprint.

from pynvml import *

nvmlInit()
handle = nvmlDeviceGetHandleByIndex(0)

mem_clk     = nvmlDeviceGetClockInfo(handle, NVML_CLOCK_MEM)
mem_clk_max = nvmlDeviceGetMaxClockInfo(handle, NVML_CLOCK_MEM)
util        = nvmlDeviceGetUtilizationRates(handle)

# Primary heuristic: high mem utilization + clock below max
throttle_ratio = mem_clk / mem_clk_max
if util.memory > 90 and throttle_ratio < 0.90:
    print(f"SUSPECTED HBM THERMAL THROTTLE")
    print(f"  MEMCLK: {mem_clk} MHz / {mem_clk_max} MHz ({throttle_ratio:.1%})")
    print(f"  Mem util: {util.memory}%")
    print(f"  Estimated BW loss: ~{(1-throttle_ratio)*100:.0f}%")

# Correlate with nsight for ground truth:
# nsys profile --stats=true python your_inference.py
# Look for: DRAM Throughput < 2.5 TB/s on H100 = bandwidth limited

5. Impact on vLLM and Memory-Bound Inference

vLLM is the canonical canary here: it saturates HBM bandwidth during prefill (filling the KV cache) and during attention decode (reading the KV cache for each generated token). Every 1% reduction in HBM bandwidth maps almost linearly to increased time-to-first-token (TTFT) and decode throughput loss. We measured an H100 SXM5 under sustained load with a Llama-3 70B model at 2048 token context:

Thermal StateHBM TempMEMCLKAchieved BWTTFT (2k ctx)Delta
Cold start 71°C 1593 MHz 3.1 TB/s 182 ms baseline
Steady state (~2h load) 89°C 1593 MHz 3.0 TB/s 189 ms +4%
Throttling active 96°C 1297 MHz (–19%) 2.3 TB/s 241 ms +32%
The Ghost in Your Dashboard

Between the steady state and throttling rows above: temperature.gpu did not change. SM clock did not change. Power draw stayed flat. The only visible change was MEMCLK dropping 19%. This is why your Grafana dashboard showing DCGM_FI_DEV_GPU_TEMP and DCGM_FI_DEV_SM_CLOCK looks completely healthy while p99 blows up.

5.1 The Bandwidth Budget Model

For autoregressive decode, TTFT and decode throughput are dominated by the memory bandwidth required to load model weights and KV cache. The relationship is approximately linear:

# Decode throughput model (simplified)
# tokens_per_second ≈ achieved_BW_GBps / bytes_per_token

# Llama-3 70B, BF16 weights, 1 GPU
# Weight load per token ≈ 140 GB (full model pass)
# KV cache per token-step ≈ varies with context length

# At 3.1 TB/s: decode ≈ ~22 tok/s
# At 2.3 TB/s: decode ≈ ~16 tok/s  (–27%)
# Delta is larger than the BW reduction because
# KV cache reads also degrade under clock reduction.

5.2 Operational Mitigations

  1. Airflow first: HBM throttling is exquisitely sensitive to inlet air temperature. Raising datacenter cold aisle from 27°C to 32°C can push H200 HBM temps from 90°C to 97°C under sustained load. Target <24°C inlet to sustain >3 TB/s continuously on SXM variants.
  2. Cap max sequences under thermal pressure: vLLM's continuous batching creates maximally sustained bandwidth load. If DCGM field 252 exceeds 93°C, consider capping --max-num-seqs to reduce KV cache pressure. A 20% reduction in concurrent sequences typically allows thermals to recover without reboot.
  3. SXM vs PCIe: H100 PCIe 80GB uses lower-power HBM2e and rarely exhibits this behavior. H200 PCIe uses HBM3e but with a lower TDP envelope — far less common. SXM is the hot path in every sense. If you see throttling on PCIe variants, check chassis airflow first.
  4. Upgrade driver: Driver ≥535.104.12 adds the HW Memory Thermal Slowdown throttle flag. This at minimum gives you a boolean you can alert on, even if you don't have DCGM. Use nvidia-smi -q -d PERFORMANCE | grep "HW Memory Thermal Slowdown".
  5. Request scheduling windows: If you cannot reduce load, consider scheduling the highest-throughput prefill jobs during cooler datacenter periods (e.g., off-peak when inlet temps drop 2–4°C). The thermal recovery window after a throttle event is typically 3–8 minutes at reduced load.

6. Grafana + Prometheus Integration

Add HBM temperature and memory clock as first-class metrics in every GPU dashboard. The full DCGM Prometheus exporter config and alert rules follow.

6.1 DCGM Exporter Configuration

# /etc/dcgm-exporter/dcp-metrics-included.csv
# Add these lines — both are critical for HBM thermal visibility

DCGM_FI_DEV_MEM_TEMP,gauge,HBM memory temperature (C) - max across all stacks. H100/H200 only.
DCGM_FI_DEV_MEM_CLOCK,gauge,Memory clock frequency (MHz). Drop from 1593 indicates thermal throttle.
DCGM_FI_DEV_MEMORY_TEMP,gauge,Alias for MEM_TEMP on older DCGM versions (field 187).

# Existing metrics you should already have:
DCGM_FI_DEV_GPU_TEMP,gauge,GPU die temperature (C).
DCGM_FI_DEV_SM_CLOCK,gauge,SM clock frequency (MHz).
DCGM_FI_DEV_POWER_USAGE,gauge,Power draw (W).
DCGM_FI_DEV_MEM_COPY_UTIL,gauge,Memory copy utilization (%).

6.2 Prometheus Alert Rules

# prometheus-hbm-alerts.yaml
groups:
  - name: hbm_thermal
    rules:

      # Primary alert: HBM temp approaching throttle threshold
      - alert: HBMThermalWarning
        expr: DCGM_FI_DEV_MEM_TEMP > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "HBM temperature warning on {{ $labels.instance }} GPU {{ $labels.gpu }}"
          description: >
            HBM temp {{ $value }}°C on {{ $labels.instance }}. H100/H200 throttles at 95°C.
            Check inlet air temp and vLLM --max-num-seqs.

      # Critical alert: throttling likely active
      - alert: HBMThermalThrottling
        expr: DCGM_FI_DEV_MEM_TEMP > 93
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "HBM THROTTLING on {{ $labels.instance }} GPU {{ $labels.gpu }}"
          description: >
            HBM temp {{ $value }}°C. MEMCLK likely reduced. Expect 20–30% TTFT degradation.
            Run: dcgmi dmon -e 252,204 -c 10

      # Memory clock below max under load (indirect throttle indicator)
      - alert: HBMClockThrottled
        expr: |
          (DCGM_FI_DEV_MEM_CLOCK / 1593) < 0.92
          and DCGM_FI_DEV_MEM_COPY_UTIL > 80
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Memory clock reduced under load on {{ $labels.instance }}"
          description: >
            MEMCLK {{ $value }} MHz (below 1593 max) while mem util >80%.
            Likely HBM thermal throttle. Check DCGM field 252.

6.3 Grafana Dashboard Panel: The Thermal Quad

The most useful single panel for quick HBM thermal diagnosis combines four time series on a dual-Y axis. Query the following for every GPU in your fleet:

# PromQL for "HBM Thermal Quad" panel
# Left Y-axis (temperature, °C): 
DCGM_FI_DEV_MEM_TEMP{instance="$instance"}       # red line, alert at 93
DCGM_FI_DEV_GPU_TEMP{instance="$instance"}        # blue line (reference)

# Right Y-axis (clocks, MHz):
DCGM_FI_DEV_MEM_CLOCK{instance="$instance"}       # orange line
DCGM_FI_DEV_SM_CLOCK{instance="$instance"}        # gray line (should be flat)

# Add horizontal threshold reference lines:
# Red dashed at 93°C (HBM warning)
# Orange dashed at 1520 MHz (5% below H100 max = early throttle sign)

When HBM throttle occurs, you will see a characteristic signature: HBM temp (red) crosses 93–95°C, MEMCLK (orange) drops sharply, SM clock stays flat, GPU die temp stays flat. The divergence between red and blue lines is the diagnostic fingerprint of HBM-specific throttling vs general GPU thermal events.

7. Why NVIDIA Hides HBM Temp: Three Hypotheses

We have raised this directly at GTC and in NVIDIA developer forums. The response is never on the record. Three factors explain the architecture decision:

7.1 Hypothesis 1: Support Ticket Suppression

CSPs buy H100/H200 with SLAs on performance and power draw, not on HBM junction temperature. If nvidia-smi showed 98°C routinely under vLLM prefill, every operator would open a severity-1 case. Yet 98°C is within HBM3e specification from SK Hynix — the rated maximum junction temperature for HBM3e Gen2 is 100°C, with sustained operation allowed up to 95°C. By hiding it, NVIDIA shifts the support burden to "performance variation," which is far harder to SLA-breach on paper.

7.2 Hypothesis 2: GSP Thermal Policy Was Unstable at Launch

Driver 535.104.12 added NVML_CLOCKS_THROTTLE_REASON_HW_MEMORY_THERMAL. Driver 550+ may expose temperature.memory in nvidia-smi by default. The GSP firmware thermal controls were actively revised during the H100 launch cycle. Exposing an unstable metric — one whose threshold, hysteresis, and policy details changed across driver versions — would cause reproducibility failures in every customer Grafana dashboard with each driver update. The field was deliberately kept out of the stable public surface until the policy stabilized.

7.3 Hypothesis 3: Competitive IP Leakage

HBM stack temperatures reveal packaging quality and binning strategy. On a given H200, if Stack 0 consistently runs 10°C hotter than Stack 5, it tells packaging engineers — including competitors — precisely which physical stack location has the most restricted thermal via contact to the heat spreader. This exposes CoWoS interposer design details, bump pitch efficiency, and HBM bin quality distribution. AMD MI300X has similar opacity on their HBM per-stack temperatures for exactly this reason.

Bottom Line

Regardless of which hypothesis is dominant, the operational result is the same: default observability is insufficient for Hopper+. You must instrument NVML or DCGM. Adding DCGM field 252 to your dashboard today costs 15 minutes. Not adding it costs you hours of incident investigation per quarter.

8. Blackwell (B100/B200): Early Notes

Early B100 and B200 samples show structurally similar behavior. Blackwell uses HBM3e at higher bandwidth targets — B200 targets 8 TB/s with 192 GB of HBM3e across 8 stacks at 4.8 GT/s per stack. The thermal density problem is, if anything, worse than H200.

ParameterH100 SXM5H200 SXMB200 SXM (early)
HBM generationHBM3HBM3eHBM3e Gen2
Stacks568
Peak BW (spec)3.35 TB/s4.8 TB/s8.0 TB/s
TDP700W700W1000W
HBM throttle visibility (default)MaskedMaskedTBD (560+ drivers)
DCGM field 252Works (DCGM ≥3.1.7)Works (DCGM ≥3.1.7)Requires DCGM ≥3.4.x
Blackwell Note

NVIDIA may expose temperature.memory by default in 560+ drivers for B200, but do not assume it. The NVML and DCGM paths described in this post will remain the ground truth regardless of what the default view exposes — because default views have historically been filtered. Assume hidden until confirmed otherwise on your specific driver version.

9. Validation Checklist: Are You Throttling Right Now?

Run through this checklist on any H100/H200 node showing unexplained inference slowdowns. Takes 5 minutes.

10. Community Data Request: Help Map Throttle Points

We are collecting cross-platform data to publish throttle onset temperatures across cooling types, chassis variants, and driver versions. Particularly needed:

How to Contribute

Run the command below on a node under load for 10 minutes, then open an issue at github.com/manishklach/thermal-ctrl-harness with your SKU, cooling type, driver version, and log. Data is anonymized before publication.

# Prereq: DCGM >= 3.1.7, driver >= 535
# Run during sustained vLLM prefill load for 10 minutes

dcgmi dmon -e 252,203,204,1002 -c 600 >> hbm_thermal_$(hostname).log
# 252 = HBM Temp, 203 = GPU Temp, 204 = Memory Clock, 1002 = Throttle Reasons

# Also collect:
nvidia-smi --query-gpu=name,pci.bus_id,driver_version,temperature.gpu,clocks.mem,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.hw_thermal_slowdown --format=csv,noheader > meta_$(hostname).csv
ipmitool sdr type Temperature 2>/dev/null | grep -i inlet >> meta_$(hostname).csv

# Combine both files and share via GitHub issue.

11. Conclusion: Instrument or Be Blind

Hopper marked the point where "GPU temperature" ceased to be a single number adequate for inference infrastructure operators. The HBM stack is now a first-class thermal domain with independent performance impact, its own trip points, and a deliberate observability gap in the default tooling. NVIDIA's reasoning — support load management, policy instability at launch, competitive IP protection — is understandable. But as operators, we inherit the consequences: unexplained latency spikes, reboots that "fix" themselves, and dashboards that look green while p99 is on fire.

The fix is a one-time instrumentation investment:

  1. Add DCGM_FI_DEV_MEM_TEMP (field 252) to every Prometheus scrape today
  2. Add DCGM_FI_DEV_MEM_CLOCK and alert when <1520 MHz under active load
  3. Set a Grafana alert at 93°C with a 2-minute duration to filter transient spikes
  4. Correlate HBM temp time series against your p99 latency — the correlation coefficient will be illuminating

The thermal harness, aggregation tools, and cross-fleet anonymized data are open at github.com/manishklach/thermal-ctrl-harness. Pull requests welcome, especially for B200, MI300X comparisons, and liquid-cooled DGX variants.


Published by Manish KL. Feedback and counter-examples welcome. If you operate H200 liquid-cooled and have DCGM logs showing different throttle onset points, please reach out via the GitHub repo above.

Related: Why MIG Performance Isn't Linear · Detecting PCIe Replays in Multi-Node NCCL · HBM Power vs Bandwidth Tradeoff Curves · GSP-RM Clock Management Deep Dive