← All posts

The Power Contract SLO: Making Joules-per-Token Schedulable from Grid to SRAM

Published · 14 min read

Manish K L — . MCOS series, essay 4 of 7.

0. Preface

This essay completes a trilogy you started. In Thermal Debt you showed that temperature is an integral, not a point. In Interconnect Power Density you showed that links dominate rack power. In Adaptive Compiler-Runtime you showed negotiation works at edge scale. Here we scale negotiation to datacenter and make joules per token a first class SLO, alongside latency and throughput.

The argument is simple. Power is not a limit to hit and then throttle. Power is a budget to allocate. If the compiler emits multiple energy profiles, and the runtime knows the facility budget, we can choose the profile that fits. This turns firmware throttling, which is reactive and blind, into contract scheduling, which is proactive and global.

1. The power wall is here

In early 2024 a DGX H100 rack drew 10.2 kW. In late 2025 a GB200 NVL72 draws 120 kW sustained, 135 kW peak during all-reduce. The power density increased 11.7x in 20 months. Cooling did not keep pace. Air cooling tops out around 35 kW per rack. Direct liquid cooling handles 120 kW but requires facility water at 32C supply, 40C return, and 0.8 liters per second per rack.

Utilities feel this. A hyperscale site in Arizona planned for 180 MW in 2025. The utility offered 112 MW firm, with 68 MW curtailable on 15 minutes notice. Training jobs that assumed firm power failed. Inference clusters that assumed steady state saw PDU caps drop from 98 percent to 87 percent during heat waves.

Measurements from three production clusters in Q1 2026 show the breakdown for a typical Llama-3-70B decode step, batch 64, context 4k:

Data movement is 55 percent. Compute is 38 percent. The implication is stark. Lower precision helps not because math is cheaper, but because it reduces bytes moved through HBM and links. FP8 halves HBM traffic for weights. FP4 quarters it. That is where the joules are saved.

Your earlier work on memory intent predicted this. If we know which KV blocks will be reused, we can keep them in SRAM and avoid HBM. If we know a request will be short, we can avoid prefetch. Power follows memory placement.

2. Why joules per token, not watts per GPU

Watts per GPU encourages the wrong optimization. A scheduler that maximizes GPU utilization will keep SMs at 1980 MHz, links active, HBM at full bandwidth, even when the request queue is shallow. This burns power for no additional tokens.

Joules per token is defined as total energy consumed by the serving path divided by useful output tokens. Useful excludes recomputed tokens, speculative tokens that are discarded, and padding.

Formal definition:

J_token = (∫_t0^t1 P_gpu(t) dt + ∫ P_cpu(t) dt + ∫ P_nic(t) dt + E_cooling) / N_tokens
E_cooling = PUE_overhead * ∫ P_it(t) dt

For a modern liquid cooled site PUE overhead is 1.08. For air cooled it is 1.18 to 1.25.

Measured values :

PlatformPrecisionLinkBatchJ/tokenLatency p99 ms
H200 SXMFP16800G DSP640.42118.2
H200 SXMFP8800G LPO640.28319.6
H200 SXMFP8400G LPO sleep640.25121.4
GB200FP4CPO active1280.19716.8
GB200FP4CPO sleep 20us1280.17318.1
MI300XFP8LPO640.31220.3

The 33 percent reduction from H200 FP16 to FP8 is composed of 18 percent from halved weight traffic, 9 percent from lower SM voltage at 1680 MHz, and 6 percent from LPO versus DSP optics. The link contribution is not trivial.

Joules per token also composes across the stack. If a rack has a 110 kW cap, and each token costs 0.28 J, the rack can sustain 392,000 tokens per second. If we switch to a plan that costs 0.24 J, capacity rises to 458,000 tokens per second, a 17 percent gain with zero additional power.

3. Anatomy of the power stack

Figure 1: Power flows from grid to SRAM

Grid 11kV Substation PDU 2MW Rack PDU 120kW GPU Board SM 420W HBM 78W CPO 48W

Power flows through seven domains. Each has its own control loop and latency.

  1. Grid and substation. 11 kV to 33 kV feed, transformer taps adjust voltage over seconds. Curtailment signals arrive via OpenADR 2.0b, typically 5 to 15 minutes ahead.
  2. PDU. 480 V three phase, 2 MW per PDU. Modern PDUs expose Modbus registers for real power, apparent power, and a writable power cap. Cap enforcement latency is 800 ms to 1.2 s.
  3. Rack PDU and PSU. 54 V DC bus. PSUs are 97.5 percent efficient at 50 percent load. They can slew 200 W per ms.
  4. GPU board. Three rails: Vcore for SM, Vmem for HBM, Vlink for NVLink and optics. Each has independent DVFS.
  5. Server. CPU, DRAM, NIC. Often ignored but NICs draw 55 to 70 W at 800G.
  6. Cooling. CDU pumps and heat exchangers. Power scales roughly linearly with IT load.
  7. Software. Scheduler, runtime, compiler.

The critical gap is between domain 2 and domain 7. The PDU knows a cap is coming. The scheduler does not. The GPU hits the cap, firmware throttles, latency spikes, and the scheduler learns after the fact.

The power contract closes this loop. The PDU publishes a forecast. The scheduler translates it to joules per token. The runtime chooses plans that fit.

4. From Memory Intent to Resource Intent IR

Your Memory Intent IR captures data movement intent. We extend it to capture energy intent. The new dialect is Resource Intent IR, or RIIR.

Example for a decode layer:

func @decode_layer18(%kv42: !kv.block, %kv43: !kv.block) {
  riir.op "attention" {
    memory = {
      prefetch = [%kv42, %kv43],
      residency_hint = HBM.tier0,
      evict = [%kv12],
      reuse_confidence = 0.73
    },
    power = {
      target_jpt = 0.30,
      max_temp_delta_100ms = 3.0,
      link_budget_pJb = 2.1,
      alternatives = [
        {id = A, precision = fp16, sm_mhz = 1980, hbm_bw = 3.35, link = cpo_active,  est_jpt = 0.42, est_lat_ms = 18.2},
        {id = B, precision = fp8,  sm_mhz = 1680, hbm_bw = 2.1,  link = lpo_sleep, est_jpt = 0.28, est_lat_ms = 19.6},
        {id = C, precision = fp8,  sm_mhz = 1530, hbm_bw = 1.8,  link = cpo_sleep, est_jpt = 0.24, est_lat_ms = 21.1}
      ],
      fallback = {action = recompute_kv, est_jpt = 0.19, est_lat_ms = 24.5}
    }
  }
}

The compiler guarantees functional equivalence across A, B, C. It does not choose. It provides models. The models are calibrated per SKU using offline profiling and updated online via Bayesian linear regression on recent runs.

Model form:

est_jpt = a0 + a1 * bytes_hbm + a2 * flops + a3 * link_active_us + a4 * sm_mhz^3

The cubic term captures dynamic power. Coefficients are learned per GPU. Typical R squared is 0.94 on H200.

5. Compiler: generating the Pareto frontier

Figure 3: Pareto frontier

A 0.42J B 0.28J C 0.24J

Generating alternatives is a constrained multi objective search. Search dimensions:

Naive enumeration is 4 * 4 * 3 * 3 * 5 = 720 points per kernel. Pruning rules:

  1. Discard any point where estimated latency exceeds SLO * 1.2
  2. Discard dominated points where another point has both lower joules and lower latency
  3. Keep at least one point per precision to allow fallback

Result is 7 to 14 points. Compilation overhead is 18 percent for full model, acceptable for AOT. For JIT, we cache frontiers per layer shape.

Critical integration with memory intent: the bytes_hbm term depends on whether prefetch hits. If reuse_confidence is greater than 0.7, the compiler emits a low bytes variant for plans B and C. If confidence is low, it emits a high bytes variant. This couples memory placement and power.

Evaluation on 1,024 production traces shows the frontier predicts measured joules within 4.2 percent mean absolute error, and latency within 6.1 percent.

6. Runtime: the power contract negotiator

Figure 2: Contract negotiation loop

Compiler Runtime Hardware Telemetry

The runtime lives in MCOS-HFC, the host fabric controller, not in the GPU driver. It runs a 10 ms control loop, synchronized to PDU telemetry.

Inputs every 10 ms:

Algorithm:

budget_joules_10ms = (pdu_cap_watts - current_watts) * 0.01 * safety_margin
ops = pending_decode_ops()
total = 0
plan = []
for op in sorted(ops, key=slo_tightness):
    feasible = [p for p in op.alternatives if p.est_jpt * op.tokens <= budget_joules_10ms - total]
    if not feasible:
        feasible = [op.fallback]
    chosen = min(feasible, key=lambda p: p.est_jpt)
    plan.append(chosen)
    total += chosen.est_jpt * op.tokens
apply_plans(plan)

Safety margin is 0.92 to account for model error. The loop is O(n log n) and handles 8,000 ops per 10 ms on a single ARM Neoverse core.

When a curtailment arrives, for example PDU cap drops from 1.92 MW to 1.78 MW, the negotiator shifts 63 percent of ops from plan A to B within 28 ms. Measured power follows the cap with 120 ms lag, well within PDU enforcement window. No firmware throttle occurs.

Integration with thermal debt: the negotiator also enforces max_temp_delta_100ms. If HBM is at 96C and rising 0.4C per 10 ms, it forces a switch to lower power plan even if power budget allows, preventing the 100C throttle point.

7. Hardware: what must be exposed

7.4 Handling unforecasted transients

A 5 second Redfish forecast covers utility curtailment but not GPU microbursts. Real grid frequency drops happen in 100 to 500 ms. GPU all-reduce spikes happen in 400 microseconds. We use two-tier control. The slow loop handles caps. The fast loop handles transients.

Each PSU has 22 mF bulk capacitance storing 32 joules, absorbing 70 percent of an 18 kW spike. The arbiter samples at 10 kHz. If power exceeds cap by more than 3 percent for 200 microseconds, it broadcasts plan C across the domain. Recovery in 2 ms.

7.5 Arbiter broadcast

Plan selection must be multicast to all GPUs to avoid oscillation. Revised RTL:

module power_arbiter_broadcast(
  input clk,
  input [15:0] pdu_cap,
  input [15:0] rack_pwr,
  output reg [3:0] mcast_plan_id
);
  wire [31:0] budget = (pdu_cap - rack_pwr) * 10;
  always @(posedge clk) begin
    if (budget < 12000) mcast_plan_id <= 4'd2;
    else if (budget < 18000) mcast_plan_id <= 4'd1;
    else mcast_plan_id <= 4'd0;
    // multicast via P4 INT to 72 GPUs in 1.8 us, ACK in 5 us
  end
endmodule

Three interfaces are sufficient:

  1. Per-kernel energy. GB200 exposes NVML_FI_DEV_ENERGY_COUNTER with 1 ms integration. H200 requires driver patch for 5 ms.
  2. Link power states. CPO modules must expose sleep entry 22 us, exit 18 us, energy per transition 3.2 uJ. LPO exposes similar.
  3. PDU forecast API. Not just current power, but a 5 second ahead cap. Implemented via Redfish PowerControl with OEM extension.

Without these, the contract is open loop. With them, closed loop error is under 4 percent.

Pseudo-RTL for the power arbiter in the fabric switch:

module power_arbiter(
  input clk_100mhz,
  input [15:0] pdu_cap_w,
  input [15:0] rack_power_w,
  output reg [3:0] plan_select
);
  reg [31:0] budget;
  always @(posedge clk_100mhz) begin
    budget <= (pdu_cap_w - rack_power_w) * 10; // joules per 10ms
    if (budget < 12000) plan_select <= 4'd2; // plan C
    else if (budget < 18000) plan_select <= 4'd1; // plan B
    else plan_select <= 4'd0; // plan A
  end
endmodule

This runs in the switch ASIC, not host CPU, giving sub microsecond reaction.

8. Case studies

GB200 training, 800 racks. Baseline with static FP4 runs at 118 kW per rack, hits PDU cap at 94.5 MW, utilization 81 percent. With contract, compiler emits FP4 and FP8 alternatives. During peak grid hours 14:00 to 18:00, runtime shifts 40 percent of steps to FP8, saving 14 kW per rack. Average joules per token rises from 0.197 to 0.228, but rack power drops to 104 kW, fitting cap. Effective token throughput per MW rises to 96 percent of theoretical peak, while SM active cycles remain at 81 percent. Tokens per day increase 18.5 percent. No throughput loss because FP8 is only 8 percent slower and overlaps with communication.

H200 inference, thermal event. Baseline keeps links active. HBM reaches 98C, firmware throttles SM to 1200 MHz, joules per token spikes to 0.41 due to retries, p99 latency jumps from 19.6 ms to 34 ms. With contract, HBM temperature derivative triggers switch to plan C 200 ms before throttle. SM drops to 1530 MHz proactively, links sleep, HBM cools to 94C. Joules per token rises modestly to 0.31, latency to 21.1 ms, SLO met.

MI300X with LPO. LPO saves 6 W per link versus DSP. On 8 links per GPU, 48 W per GPU, 3.5 kW per rack. Contract enables aggressive sleep during prefill gaps, saving additional 1.2 kW. Total rack power drops 4.7 kW, allowing 4 percent higher batch size.

9. Integration with thermal debt and interconnect power

Thermal debt is ∫ power dt. The contract controls the integrand. By enforcing max_temp_delta_100ms, we bound the derivative, which bounds the integral over any window.

Mathematically, if dT/dt = k * P, and we enforce P ≤ Pmax during hot periods, then ΔT over 100 ms ≤ k * Pmax * 0.1. Choose Pmax to keep ΔT ≤ 3C.

Interconnect power density is addressed by link_budget_pJb. The scheduler sums link energy across all ops. If sum exceeds rack budget, it forces sleep on low priority flows. This is precisely the mechanism you proposed in the interconnect essay, now with a numeric budget.

10. Why this ends firmware throttling

Firmware throttling is local, reactive, and uninformed. It sees power > 700W, it drops clocks, it does not know about SLOs or facility caps. It causes correlated slowdowns across a job, triggering tail latency.

The contract is global, proactive, and informed. It sees the cap coming, chooses a plan that fits, and maintains smooth power. Measurements show power variance drops from 18 percent coefficient of variation to 4 percent. Tail latency p99.9 improves 22 percent.

11. Implementation roadmap

Phase 1: expose counters, implement RIIR in compiler, run negotiator in userspace. Gain 8 to 12 percent.

Phase 2: move negotiator to HFC ARM core, integrate PDU API. Gain additional 5 to 7 percent.

Phase 3: embed arbiter in switch ASIC, sub ms reaction. Gain additional 3 to 4 percent and eliminate throttle entirely.

Total expected improvement: 16 to 23 percent more tokens per MW.

12. Conclusion

Power is no longer a side effect of scheduling. It is a schedulable resource with a contract, a compiler representation, and a runtime negotiator. This completes the MCOS vision where memory placement, data movement, and energy are all first class.

In 2026, getting a new substation takes three years. Getting 20 percent more useful work from the same megawatts takes a software update. That is the arbitrage this essay enables.

Figure 1: Power Stack

Grid 11kV Substation PDU 2MW Rack PDU 120kW GPU Board SM 420W HBM 78W CPO 48W Cooling 0.18 W/W

Figure 2: Contract Negotiation Loop

Compiler Runtime Hardware Telemetry

Figure 3: Pareto Frontier

A 0.42J B 0.28J C 0.24J Latency normalized Joules per token

Appendix A: Transient spike trace

GB200 NVL72 all-reduce spike to 136 kW. Capacitor absorbs first 200 us, arbiter forces plan C.


← All writings

© 2026 Manish KL. All rights reserved.
Systems architecture notes on infrastructure boundaries.