How to Measure GPU Underutilization on NVIDIA H100 and H200
Point-in-time GPU busy metrics miss a lot. Rolling-window low-utilization time, sampled idle-state behavior, and power draw tell a much better story about whether an expensive GPU is actually doing meaningful work over time.
If you run expensive NVIDIA GPUs, the real question is rarely just "is the job scheduled?" It is: is the GPU actually doing meaningful work over time?
That is the idea behind gpu-low-util-monitor, a Linux-first observability tool aimed initially at NVIDIA datacenter GPUs such as H100 and H200. The repo measures low-utilization time, idle-state behavior, and power-based activity over rolling windows using documented NVIDIA signals, with the long-window low-utilization percentage as the headline KPI.
A before-and-after story
Imagine a GPU that looks roughly 70% busy in nvidia-smi. That sounds acceptable until the
rolling-window view shows something else: long-window low-utilization stays elevated, Idle entries keep
climbing, and average power looks dim relative to the host's normal busy range.
That changes the operator question from "is the job running?" to "is the GPU actually being fed steadily?" In practice, that is often where you find bursty dispatch, small batches, host-side stalls, input bubbles, or otherwise underfed GPUs.
Point-in-time GPU busy metrics can miss bursty dispatch, host-side stalls, short idle gaps, and underfed pipelines. Rolling-window low-utilization, sampled Idle-state behavior, and power telemetry give a better picture of whether the GPU is doing meaningful work over time.
Why point-in-time GPU utilization is not enough
A GPU can look "busy enough" in a snapshot while still spending a meaningful share of the last few minutes drifting into low-utilization policy, bouncing in and out of idle, or drawing far below expected power. That is why rolling windows matter.
Instead of asking "what is the GPU doing right now?", you ask:
- How much of the last minute was it in low-utilization policy?
- How much of the last several minutes did it appear idle at sample times?
- How often did it enter Idle?
- Was its power profile dark, dim, or bright over the same period?
What the repo measures
The repository uses NVIDIA NVML as the primary runtime source of truth and focuses on documented low-utilization, idle-state, and power signals:
NVML_FI_DEV_PERF_POLICY_LOW_UTILIZATION- current clocks event reasons, including detection of the documented
Idlereason - standard telemetry such as GPU utilization, SM clock, memory clock, power draw, power cap, GPU name, UUID, and index
- cumulative total energy when the runtime path exposes it
The repository treats low-utilization over time as the primary KPI and uses power as a complementary first-order activity indicator rather than a replacement for behavioral telemetry.
Low-utilization policy time
The strongest behavioral signal for how much recent wall time the GPU spent in documented low-utilization policy.
Sampled Idle-state behavior
Shows whether the GPU frequently appeared idle when sampled, and whether work arrived in bursty episodes.
Power-based activity
A first-order dark / dim / bright signal that complements low-util and idle telemetry.
The core metrics
low_util_pct_window
The fraction of elapsed wall time in the rolling window during which the documented low-utilization perf-policy counter increased:
100 * sum(delta_low_util_counter_ns) / sum(delta_elapsed_ns)
It is clamped to [0, 100], uses monotonic time, and treats counter resets
defensively.
idle_reason_pct_window
The fraction of samples in the window where the current Idle clock-event reason was active.
This is sampled state presence, not a cumulative hardware timer.
idle_entries_window
A software-derived count of false -> true transitions into Idle state based on polling.
NVIDIA does not provide this as a built-in counter through the documented signals used here.
Power and energy metrics
The repo now also tracks current power, rolling average power, power as a percentage of cap, optional calibrated power activity, and cumulative energy deltas when available. These are complementary observability signals, not official utilization truth.
Why power matters
Power tells you that a GPU looks dark or dim; low-util and idle telemetry help explain how often and in what pattern that is happening.
That is the most useful way to think about the combined system:
- Low power + high long-window low-util -> likely underfed or poorly packed GPU
- Low power + high long-window sampled Idle -> likely frequently idle at sample times
- Low power + high long-window idle entries -> likely bursty or intermittent dispatch
- High power + low long-window low-util -> likely healthy busy GPU
- High power + high long-window low-util -> investigate; unusual workload shape, calibration issue, or mismatch
Power is especially useful for dark / dim / bright fleet heat-map style observability, but it remains a proxy rather than a perfect root-cause detector.
Example output
The repository becomes much more tangible once you look at a real summary row. Here is the same style of console snapshot used in the README, showing a dimmer H100 beside a bright busy H200:
gpu idx | name | low_util_short(default 60s) | low_util_long(default 1200s) | idle_pct_short(default 60s) | idle_pct_long(default 1200s) | idle_entries_long(default 1200s) | current_power_w | avg_power_short(default 60s) | avg_power_long(default 1200s) | power_pct_cap_long(default 1200s) | power_activity_long(default 1200s) | util_long(default 1200s) | sm_clk_long(default 1200s)
0 | NVIDIA H100 80GB HBM3 | 65.8 | 67.1 | 30.5 | 33.2 | 49 | 212.0 | 218.6 | 216.1 | 30.9 | 22.0 | 27.4 | 1008.8
1 | NVIDIA H200 141GB HBM3e | 2.0 | 2.0 | 0.0 | 0.0 | 0 | 662.0 | 662.0 | 662.0 | 94.6 | 93.9 | 96.0 | 1830.0
The left-hand GPU is not simply "idle." It is dim, frequently low-util, and repeatedly re-entering Idle. The right-hand GPU looks electrically bright and behaviorally steady.
What this repo is and is not
This project is compelling because it is specific. It is not "yet another GPU monitor." It is about low-utilization detection, idle-state episodes, power-based activity, and rolling-window observability on NVIDIA datacenter GPUs.
It is not:
- a perfect utilization truth engine
- an exact cost-loss calculator
- a complete root-cause analysis system
The current README keeps the right tone: this is a practical proxy for GPU underuse, workload starvation, underfeeding, or dark/dim GPUs, while avoiding claims of omniscient knowledge about economic waste or all causes of low activity.
Why this matters for H100 and H200
On high-end datacenter GPUs, silent underuse is expensive. Scheduler-level success can hide meaningful low-util time, repeated idle episodes, and dim power behavior. A GPU can be allocated, visible, and "running" while still spending too much of recent wall time not doing enough useful work.
That is why rolling-window observability matters more than point-in-time snapshots.
Try the repo
If you want to detect dark, dim, bursty, or underfed GPUs instead of relying only on point-in-time busy metrics, you can try the repo locally in a minute:
git clone https://github.com/manishklach/gpu-low-util-monitor
cd gpu-low-util-monitor
pip install -e ".[dev]"
python -m gpu_low_util_monitor --simulate --once --verbose
Repo: github.com/manishklach/gpu-low-util-monitor
It is a strong example of how to talk about GPU underuse honestly: not as a universal truth claim, but as a rolling-window observability signal grounded in documented NVIDIA behavior.
FAQ
How is this different from DCGM?
This repository is still NVML-first, but it now also supports an optional DCGM exporter ingest mode. NVML remains the higher-fidelity path for the headline low-utilization and sampled Idle-state signals. DCGM mode is useful in environments that already expose documented DCGM metrics, but it is intentionally degraded where those NVML-specific behavioral signals do not map cleanly.
What polling interval should I use?
The default 1-second interval is a sensible starting point for most hosts. Shorter intervals can capture
more brief idle episodes but cost more polling overhead. Longer intervals are lighter but can miss short
false -> true Idle transitions.
Is GPU power enough on its own?
No. Power is a strong first-order activity proxy and especially useful for dark or dim fleet heat maps, but it is still a proxy. The repo combines power with low-utilization policy time and sampled Idle behavior so operators get a stronger picture than any single metric provides.
Can I use this without H100 or H200 hardware yet?
Yes. The repository includes simulation mode and a fake NVML backend so you can validate rolling-window math, output schemas, and power behavior locally before running on a real NVIDIA datacenter GPU host.