Manish K L — April 2026. MCOS series, essay 2 of 7. Full version.
This essay blends three kinds of claims, and I label them explicitly to maintain credibility.
Where I use exact numbers below, assume they refer to the simulated configuration unless stated otherwise.
In Essay 4 we made joules per token schedulable. Here we make gigabytes per second schedulable. The two are inseparable. Power follows bandwidth. If you cannot partition HBM bandwidth, you cannot honor power contracts in multi-tenant settings. This essay provides the missing fabric.
In our trace-driven simulations based on anonymized production traces from March 2026, for decode-heavy workloads, about 68 percent of GPU stall cycles were memory stalls in the simulated decode-heavy mix, not compute stalls. For a representative 70B model at batch 64, an analytic estimate suggests each decode step reads roughly 1.3 MB per token from KV cache. At a modeled 3,800 tokens per second, that corresponds to about 5 GB/s sustained per request stream. With 8 tenants, aggregate demand is 39.8 GB/s, only 1.2 percent of HBM peak. So why does contention occur?
Because prefill is bursty. In one representative configuration, a 32k context prefill moves roughly 42 GB, which at full HBM bandwidth would occupy the bus for on the order of 50 ms, 24 percent of HBM. During that burst, decode requests from other tenants queue behind 1,200 memory requests. The memory controller FCFS policy gives no priority. In simulation, p99 decode latency rose from about 20 ms to roughly 84 ms under FCFS, a greater than 4x increase.
This is a classic network congestion problem, but inside the GPU.
Existing isolation partitions SMs via MPS, partitions HBM capacity via CUDA virtual memory, and partitions power via nvidia-smi. None partitions bandwidth. The memory controller sees a flat stream of requests tagged only by physical address.
Attempted workarounds:
We need hardware enforced bandwidth contracts.
Readers often ask why we cannot reuse existing QoS.
CUDA stream priority: Priorities affect kernel launch order, not memory controller arbitration. Once kernels are in flight, their memory requests intermix at the HBM controller with no tenant tag.
Software throttling: Inserting sleeps reduces average bandwidth but provides no worst-case guarantee. A burst still arrives eventually, and tail latency remains uncontrolled. In simulation, software throttling reduced p99 by only 18 percent while hurting throughput 22 percent.
NIC-style QoS: Network switches use token buckets, but they operate on packets with explicit headers. HBM requests lack tenant IDs in current hardware. Our proposal adds a 4-bit tenant tag in the memory request, analogous to a VLAN tag.
Static bank partitioning: Assigning banks statically eliminates interference but wastes capacity. If tenant A is idle, its banks sit empty while tenant B is throttled. Our dynamic WFQ shares unused bandwidth automatically.
MIG/MPS: These partition compute and capacity, not bandwidth. MIG on Hopper still shares the same HBM controllers across instances.
The gap is not a lack of ideas, it is a lack of enforcement at the memory controller.
Figure 4: Prefill burst disrupting decode
Figure 1: Three-tier KV fabric
Tiers: HBM for pages with reuse distance less than 10 ms, CXL for 10 to 200 ms, remote for greater than 200 ms. The arbiter lives in the HBM controller, with shadow copies in CXL root complex.
We extend RIIR from Essay 4:
riir.memory = {
bandwidth = {
guaranteed_gbps = 120,
burst_gbps = 400,
burst_duration_ms = 2.0,
priority = 1, // 0 highest
latency_slo_us = 25
},
placement = {
hbm_quota_mb = 8192,
cxl_quota_mb = 32768,
page_color = 0x3 // banks 0-1
}
}
Guaranteed is enforced. Burst is opportunistic. Priority resolves contention when sum of guaranteed exceeds capacity.
Each tenant i has bucket Bi refilled at rate ri bytes per microsecond. Bucket capacity Ci = ri * burst_duration.
At time t, on request of size s:
if Bi >= s: admit, Bi -= s else if sum_j rj < capacity: admit using burst, Bi = 0 else: queue
WFQ weight wi = ri / sum rj. The scheduler computes virtual finish time Fi = max(V, Ai) + s/wi, where V is virtual time, Ai arrival. Requests issued in increasing Fi order.
In simulation, this provided isolation within about 3 percent of the target share in simulation, versus 340 percent with FCFS.
In practice, the contract is not emitted by a mythical new compiler. For PyTorch 2.x, this is a TorchInductor post-lowering pass that runs after graph capture: it walks the FX graph, annotates each attention and GEMM node with estimated bytes and reuse distance from the profile, and emits the RIIR bandwidth alternatives as metadata attached to the compiled kernel. For XLA/JAX, it is an HLO pass just before codegen that inserts the same metadata into custom call attributes. For vLLM and TensorRT-LLM serving, the runtime reads the metadata at model load and registers the contracts with MCOS-HFC; the scheduler then chooses the plan at request dispatch, not at compile time. The compiler proposes the Pareto frontier, the runtime selects based on live fabric and power state.
The compiler performs reuse analysis on KV accesses. For each loop nest:
For a 32k prefill, compiler splits into 8 chunks of 4k. Each chunk gets burst contract, then yields. This is implemented via async prefetch:
for chunk in range(8): prefetch kv[chunk*4k:(chunk+1)*4k] to hbm.async wait bandwidth_token compute attention chunk release bandwidth_token
Yield points allow other tenants to interleave.
Figure 2: Token bucket in memory controller
Implementation cost: 8 tenants * 64 bits for bucket state = 64 bytes per memory channel. On H200 with 8 channels, 512 bytes total. Logic runs at 1 GHz, 1 cycle per request.
Pseudo-RTL (simplified, omits starvation prevention, multi-channel coordination, and reorder buffer effects):
always @(posedge clk) begin for (i=0; i= req_size[i]) begin grant[i] <= 1; bucket[i] <= bucket[i] - req_size[i]; end end end
HBM3e has 16 banks per stack, 8 stacks. Without coloring, tenant A and B collide on same bank, serializing requests. We assign colors:
tenant 0: banks 0-3, tenant 1: banks 4-7, etc. The OS allocator respects color in physical page allocation. In our modeled system, bank conflict rate dropped from roughly 23 percent to about 4 percent with coloring.
The fabric arbiter exports metrics to MCOS-HFC every 10 ms: bytes served per tenant, queue depth, violations. The power contract negotiator from Essay 4 reads these and computes power = bytes * 0.023 W/GB/s + baseline. If power exceeds budget, it reduces burst_gbps for low priority tenants.
This closes the loop between bandwidth and joules per token.
With CXL spill, tail latency without fabric was about 112 ms in simulation. With fabric and coloring, about 29 ms. Effective HBM utilization rose from roughly 58 percent to about 87 percent in the model, driven by reduced bank conflicts.
Power interaction. During same test, HBM power drops from 78 W average to 69 W because fewer bank conflicts mean fewer row activations. Analytic modeling suggests this could save on the order of 5 to 10 W per GPU from reduced row activations, 648 W per rack, directly improving joules per token by 4.2 percent.
Token buckets work for steady state, not for synchronized bursts from all tenants. If 8 tenants start prefill simultaneously, sum of guaranteed exceeds capacity. Solution: admission control at scheduler, reject new prefill if sum guaranteed + new > 0.85 * capacity.
KV cache bandwidth must be partitioned like network bandwidth. With compiler contracts, token buckets in hardware, and page coloring, multi-tenant tail latency improves 3.6x and throughput improves 28 to 31 percent. Combined with power contracts, this yields a full resource fabric for LLM inference.
Figure 3: Tail latency with and without fabric (simulated)
Simulated H200, 4 tenants, 1 prefill + 3 decode. Bars show p99 decode latency.
Making Joules-per-Token Schedulable from Grid to SRAM
In modeled deployments, a GB200 NVL72 rack draws roughly 120 kW sustained, with peaks near 135 kW during all-reduce. At 10k GPU scale, this approaches 17–20 MW.
Throughput per watt is the right metric. Representative numbers: H200 FP16 ~0.42 J/token, FP8 with LPO ~0.28 J/token, GB200 FP4 with CPO sleep ~0.17 J/token.
Firmware throttling reacts to temperature, not SLOs. Static caps waste headroom. DVFS alone ignores memory and links which are ~55% of power.
power = { target_jpt=0.30, alternatives=[A, B, C] }
In practice, the contract is not emitted by a new compiler. For PyTorch 2.x, this is a TorchInductor post-lowering pass that annotates each kernel with estimated joules and emits RIIR power alternatives as metadata. For XLA/JAX, it is an HLO pass before codegen that inserts power metadata into custom calls. For vLLM and TensorRT-LLM, the runtime reads the metadata at model load and registers contracts with MCOS; the scheduler selects the plan at request dispatch based on live PDU forecasts. The compiler proposes the Pareto frontier, the runtime selects.
Search yields 7–14 Pareto-optimal plans. Model error ~4% for joules in simulation.
Two-tier control: 5s forecast for utility, 200µs local override using PSU capacitors (~32J) and arbiter broadcast.
Effective token throughput per MW rose from ~81% to ~96% of theoretical in simulation.
Power is not a limit to hit. It is a budget to allocate. Once joules per token becomes schedulable, the datacenter stops throttling and starts negotiating.
Make power a contract, not a crisis.
← All writings
© 2026 MANISH AI. All rights reserved.
Systems architecture notes on infrastructure boundaries.