How to Measure GPU Underutilization on NVIDIA H100 and H200

If you run expensive NVIDIA GPUs, the real question is rarely just "is the job scheduled?" It is: is the GPU actually doing meaningful work over time?

That is the idea behind gpu-low-util-monitor, a Linux-first observability tool aimed initially at NVIDIA datacenter GPUs such as H100 and H200. The repo measures low-utilization time, idle-state behavior, and power-based activity over rolling windows using documented NVIDIA signals, with the long-window low-utilization percentage as the headline KPI.

This is the right framing: not a magical truth engine, not an exact billing-waste meter, but a practical rolling-window observability signal grounded in documented NVIDIA behavior.

A before-and-after story

Imagine a GPU that looks roughly 70% busy in nvidia-smi. That sounds acceptable until the rolling-window view shows something else: long-window low-utilization stays elevated, Idle entries keep climbing, and average power looks dim relative to the host's normal busy range.

That changes the operator question from "is the job running?" to "is the GPU actually being fed steadily?" In practice, that is often where you find bursty dispatch, small batches, host-side stalls, input bubbles, or otherwise underfed GPUs.

Point-in-time GPU busy metrics can miss bursty dispatch, host-side stalls, short idle gaps, and underfed pipelines. Rolling-window low-utilization, sampled Idle-state behavior, and power telemetry give a better picture of whether the GPU is doing meaningful work over time.

Why point-in-time GPU utilization is not enough

A GPU can look "busy enough" in a snapshot while still spending a meaningful share of the last few minutes drifting into low-utilization policy, bouncing in and out of idle, or drawing far below expected power. That is why rolling windows matter.

Instead of asking "what is the GPU doing right now?", you ask:

How much of the last minute was it in low-utilization policy?
How much of the last several minutes did it appear idle at sample times?
How often did it enter Idle?
Was its power profile dark, dim, or bright over the same period?

What the repo measures

The repository uses NVIDIA NVML as the primary runtime source of truth and focuses on documented low-utilization, idle-state, and power signals:

NVML_FI_DEV_PERF_POLICY_LOW_UTILIZATION
current clocks event reasons, including detection of the documented Idle reason
standard telemetry such as GPU utilization, SM clock, memory clock, power draw, power cap, GPU name, UUID, and index
cumulative total energy when the runtime path exposes it

The repository treats low-utilization over time as the primary KPI and uses power as a complementary first-order activity indicator rather than a replacement for behavioral telemetry.

Low-utilization policy time

The strongest behavioral signal for how much recent wall time the GPU spent in documented low-utilization policy.

Sampled Idle-state behavior

Shows whether the GPU frequently appeared idle when sampled, and whether work arrived in bursty episodes.

Power-based activity

A first-order dark / dim / bright signal that complements low-util and idle telemetry.

The core metrics

`low_util_pct_window`

The fraction of elapsed wall time in the rolling window during which the documented low-utilization perf-policy counter increased:

100 * sum(delta_low_util_counter_ns) / sum(delta_elapsed_ns)

It is clamped to [0, 100], uses monotonic time, and treats counter resets defensively.

`idle_reason_pct_window`

The fraction of samples in the window where the current Idle clock-event reason was active. This is sampled state presence, not a cumulative hardware timer.

`idle_entries_window`

A software-derived count of false -> true transitions into Idle state based on polling. NVIDIA does not provide this as a built-in counter through the documented signals used here.

Power and energy metrics

The repo now also tracks current power, rolling average power, power as a percentage of cap, optional calibrated power activity, and cumulative energy deltas when available. These are complementary observability signals, not official utilization truth.

Why power matters

Power tells you that a GPU looks dark or dim; low-util and idle telemetry help explain how often and in what pattern that is happening.

Rolling-window GPU low-utilization monitoring is most useful when the signals agree: low-utilization time, idle-state behavior, and GPU power telemetry together make underfed GPUs easier to recognize than any single snapshot metric alone.

That is the most useful way to think about the combined system:

Low power + high long-window low-util -> likely underfed or poorly packed GPU
Low power + high long-window sampled Idle -> likely frequently idle at sample times
Low power + high long-window idle entries -> likely bursty or intermittent dispatch
High power + low long-window low-util -> likely healthy busy GPU
High power + high long-window low-util -> investigate; unusual workload shape, calibration issue, or mismatch

Power is especially useful for dark / dim / bright fleet heat-map style observability, but it remains a proxy rather than a perfect root-cause detector.

Example output

The repository becomes much more tangible once you look at a real summary row. Here is the same style of console snapshot used in the README, showing a dimmer H100 beside a bright busy H200:

gpu idx | name | low_util_short(default 60s) | low_util_long(default 1200s) | idle_pct_short(default 60s) | idle_pct_long(default 1200s) | idle_entries_long(default 1200s) | current_power_w | avg_power_short(default 60s) | avg_power_long(default 1200s) | power_pct_cap_long(default 1200s) | power_activity_long(default 1200s) | util_long(default 1200s) | sm_clk_long(default 1200s)
0 | NVIDIA H100 80GB HBM3 | 65.8 | 67.1 | 30.5 | 33.2 | 49 | 212.0 | 218.6 | 216.1 | 30.9 | 22.0 | 27.4 | 1008.8
1 | NVIDIA H200 141GB HBM3e | 2.0 | 2.0 | 0.0 | 0.0 | 0 | 662.0 | 662.0 | 662.0 | 94.6 | 93.9 | 96.0 | 1830.0

The left-hand GPU is not simply "idle." It is dim, frequently low-util, and repeatedly re-entering Idle. The right-hand GPU looks electrically bright and behaviorally steady.

What this repo is and is not

This project is compelling because it is specific. It is not "yet another GPU monitor." It is about low-utilization detection, idle-state episodes, power-based activity, and rolling-window observability on NVIDIA datacenter GPUs.

It is not:

a perfect utilization truth engine
an exact cost-loss calculator
a complete root-cause analysis system

The current README keeps the right tone: this is a practical proxy for GPU underuse, workload starvation, underfeeding, or dark/dim GPUs, while avoiding claims of omniscient knowledge about economic waste or all causes of low activity.

Why this matters for H100 and H200

On high-end datacenter GPUs, silent underuse is expensive. Scheduler-level success can hide meaningful low-util time, repeated idle episodes, and dim power behavior. A GPU can be allocated, visible, and "running" while still spending too much of recent wall time not doing enough useful work.

That is why rolling-window observability matters more than point-in-time snapshots.

Try the repo

If you want to detect dark, dim, bursty, or underfed GPUs instead of relying only on point-in-time busy metrics, you can try the repo locally in a minute:

git clone https://github.com/manishklach/gpu-low-util-monitor
cd gpu-low-util-monitor
pip install -e ".[dev]"
python -m gpu_low_util_monitor --simulate --once --verbose

Repo: github.com/manishklach/gpu-low-util-monitor

It is a strong example of how to talk about GPU underuse honestly: not as a universal truth claim, but as a rolling-window observability signal grounded in documented NVIDIA behavior.

FAQ

How is this different from DCGM?

This repository is still NVML-first, but it now also supports an optional DCGM exporter ingest mode. NVML remains the higher-fidelity path for the headline low-utilization and sampled Idle-state signals. DCGM mode is useful in environments that already expose documented DCGM metrics, but it is intentionally degraded where those NVML-specific behavioral signals do not map cleanly.

What polling interval should I use?

The default 1-second interval is a sensible starting point for most hosts. Shorter intervals can capture more brief idle episodes but cost more polling overhead. Longer intervals are lighter but can miss short false -> true Idle transitions.

Is GPU power enough on its own?

No. Power is a strong first-order activity proxy and especially useful for dark or dim fleet heat maps, but it is still a proxy. The repo combines power with low-utilization policy time and sampled Idle behavior so operators get a stronger picture than any single metric provides.

Can I use this without H100 or H200 hardware yet?

Yes. The repository includes simulation mode and a fake NVML backend so you can validate rolling-window math, output schemas, and power behavior locally before running on a real NVIDIA datacenter GPU host.

A before-and-after story

Why point-in-time GPU utilization is not enough

What the repo measures

Low-utilization policy time

Sampled Idle-state behavior

Power-based activity

The core metrics

low_util_pct_window

idle_reason_pct_window

idle_entries_window

Power and energy metrics

Why power matters

Example output

What this repo is and is not

Why this matters for H100 and H200

Try the repo

FAQ

How is this different from DCGM?

What polling interval should I use?

Is GPU power enough on its own?

Can I use this without H100 or H200 hardware yet?

References

`low_util_pct_window`

`idle_reason_pct_window`

`idle_entries_window`