Multi-Tenant KV Fabrics: Bandwidth Contracts for Shared LLM Memory Systems

Published Apr 19, 2026 · 12 min read

Manish K L — April 2026. MCOS series, essay 2 of 7. Full version.

0.1 Primer: terms for beginners

KV cache: The model’s memory of previous tokens, stored as Key and Value matrices to avoid recomputation.
Prefill: Processing the full prompt at once. Bandwidth heavy.
Decode: Generating one token at a time. Latency sensitive.
HBM: Fast GPU memory, ~3.3 TB/s on H200.
CXL: Slower, larger expansion memory over PCIe, ~128 GB/s.
Bandwidth: Bytes per second the memory can move.
Token bucket: Rate limiter that refills at guaranteed speed.
WFQ: Scheduler giving each tenant fair share by weight.
Bank conflict: Two requests hitting same HBM bank, causing wait.
Page coloring: Pinning tenants to specific banks to avoid conflicts.
Tenant: One user/model sharing the GPU.
p99 latency: Worst-case latency experienced by 99% of requests.

0.2 Evidence basis and scope

This essay blends three kinds of claims, and I label them explicitly to maintain credibility.

Measured on public hardware: HBM bandwidth 3.35 TB/s on H200, CXL 2.0 x16 ~128 GB/s, and power per GB/s ~0.023 W are taken from vendor datasheets and published MLPerf inference numbers from late 2025.
Trace-driven simulation: stall cycle percentages, tail latency multipliers (4.27x), isolation bounds (within 2.8 percent), and bank conflict rates come from a cycle-level memory controller simulator fed with anonymized production traces from March 2026. The simulator models 8 HBM channels, 16 banks, FCFS versus WFQ, with 4 to 8 tenants.
Analytic estimates: bytes per token (1.31 MB for 70B), prefill burst duration (42 GB in ~52 ms), and power savings (about 9 W per GPU) are back-of-the-envelope calculations using model dimensions and published energy per bit for HBM3e. They illustrate scale, not precise measurement.
Design proposal: the RIIR bandwidth contract, token bucket in the memory controller, and page coloring policy are proposed architecture, not yet silicon. The pseudo-RTL is intentionally simplified to convey mechanism.

Where I use exact numbers below, assume they refer to the simulated configuration unless stated otherwise.

0. Preface

In Essay 4 we made joules per token schedulable. Here we make gigabytes per second schedulable. The two are inseparable. Power follows bandwidth. If you cannot partition HBM bandwidth, you cannot honor power contracts in multi-tenant settings. This essay provides the missing fabric.

1. The KV cache is the network

In our trace-driven simulations based on anonymized production traces from March 2026, for decode-heavy workloads, about 68 percent of GPU stall cycles were memory stalls in the simulated decode-heavy mix, not compute stalls. For a representative 70B model at batch 64, an analytic estimate suggests each decode step reads roughly 1.3 MB per token from KV cache. At a modeled 3,800 tokens per second, that corresponds to about 5 GB/s sustained per request stream. With 8 tenants, aggregate demand is 39.8 GB/s, only 1.2 percent of HBM peak. So why does contention occur?

Because prefill is bursty. In one representative configuration, a 32k context prefill moves roughly 42 GB, which at full HBM bandwidth would occupy the bus for on the order of 50 ms, 24 percent of HBM. During that burst, decode requests from other tenants queue behind 1,200 memory requests. The memory controller FCFS policy gives no priority. In simulation, p99 decode latency rose from about 20 ms to roughly 84 ms under FCFS, a greater than 4x increase.

This is a classic network congestion problem, but inside the GPU.

2. Why current isolation fails

Existing isolation partitions SMs via MPS, partitions HBM capacity via CUDA virtual memory, and partitions power via nvidia-smi. None partitions bandwidth. The memory controller sees a flat stream of requests tagged only by physical address.

Attempted workarounds:

Time slicing: run tenant A prefill, then tenant B decode. Utilization drops to 43 percent.
Static bank partitioning: assign banks 0-3 to A, 4-7 to B. Reduces peak bandwidth per tenant by 50 percent, hurts throughput.
Software throttling: insert sleeps in prefill kernel. Increases latency by 22 percent and does not guarantee isolation.

We need hardware enforced bandwidth contracts.

2.5 Why not just use existing mechanisms

Readers often ask why we cannot reuse existing QoS.

CUDA stream priority: Priorities affect kernel launch order, not memory controller arbitration. Once kernels are in flight, their memory requests intermix at the HBM controller with no tenant tag.

Software throttling: Inserting sleeps reduces average bandwidth but provides no worst-case guarantee. A burst still arrives eventually, and tail latency remains uncontrolled. In simulation, software throttling reduced p99 by only 18 percent while hurting throughput 22 percent.

NIC-style QoS: Network switches use token buckets, but they operate on packets with explicit headers. HBM requests lack tenant IDs in current hardware. Our proposal adds a 4-bit tenant tag in the memory request, analogous to a VLAN tag.

Static bank partitioning: Assigning banks statically eliminates interference but wastes capacity. If tenant A is idle, its banks sit empty while tenant B is throttled. Our dynamic WFQ shares unused bandwidth automatically.

MIG/MPS: These partition compute and capacity, not bandwidth. MIG on Hopper still shares the same HBM controllers across instances.

The gap is not a lack of ideas, it is a lack of enforcement at the memory controller.

Figure 4: Prefill burst disrupting decode

3. Fabric architecture

Figure 1: Three-tier KV fabric

Tiers: HBM for pages with reuse distance less than 10 ms, CXL for 10 to 200 ms, remote for greater than 200 ms. The arbiter lives in the HBM controller, with shadow copies in CXL root complex.

4. Bandwidth as a contract

We extend RIIR from Essay 4:

riir.memory = {
  bandwidth = {
    guaranteed_gbps = 120,
    burst_gbps = 400,
    burst_duration_ms = 2.0,
    priority = 1, // 0 highest
    latency_slo_us = 25
  },
  placement = {
    hbm_quota_mb = 8192,
    cxl_quota_mb = 32768,
    page_color = 0x3 // banks 0-1
  }
}

Guaranteed is enforced. Burst is opportunistic. Priority resolves contention when sum of guaranteed exceeds capacity.

5. Token bucket and WFQ math

Each tenant i has bucket Bi refilled at rate ri bytes per microsecond. Bucket capacity Ci = ri * burst_duration.

At time t, on request of size s:

if Bi >= s: admit, Bi -= s
else if sum_j rj < capacity: admit using burst, Bi = 0
else: queue

WFQ weight wi = ri / sum rj. The scheduler computes virtual finish time Fi = max(V, Ai) + s/wi, where V is virtual time, Ai arrival. Requests issued in increasing Fi order.

In simulation, this provided isolation within about 3 percent of the target share in simulation, versus 340 percent with FCFS.

6.1 Where this lives in today's stack

In practice, the contract is not emitted by a mythical new compiler. For PyTorch 2.x, this is a TorchInductor post-lowering pass that runs after graph capture: it walks the FX graph, annotates each attention and GEMM node with estimated bytes and reuse distance from the profile, and emits the RIIR bandwidth alternatives as metadata attached to the compiled kernel. For XLA/JAX, it is an HLO pass just before codegen that inserts the same metadata into custom call attributes. For vLLM and TensorRT-LLM serving, the runtime reads the metadata at model load and registers the contracts with MCOS-HFC; the scheduler then chooses the plan at request dispatch, not at compile time. The compiler proposes the Pareto frontier, the runtime selects based on live fabric and power state.

6. Compiler: generating contracts

The compiler performs reuse analysis on KV accesses. For each loop nest:

Compute reuse distance distribution from profiling
Estimate bandwidth demand = bytes_accessed / compute_time
Classify as decode (steady) or prefill (burst)
Emit contract with guaranteed = p50 demand, burst = p95

For a 32k prefill, compiler splits into 8 chunks of 4k. Each chunk gets burst contract, then yields. This is implemented via async prefetch:

for chunk in range(8):
  prefetch kv[chunk*4k:(chunk+1)*4k] to hbm.async
  wait bandwidth_token
  compute attention chunk
  release bandwidth_token

Yield points allow other tenants to interleave.

7. Hardware implementation

Figure 2: Token bucket in memory controller

Implementation cost: 8 tenants * 64 bits for bucket state = 64 bytes per memory channel. On H200 with 8 channels, 512 bytes total. Logic runs at 1 GHz, 1 cycle per request.

Pseudo-RTL (simplified, omits starvation prevention, multi-channel coordination, and reorder buffer effects):

always @(posedge clk) begin
  for (i=0; i= req_size[i]) begin
      grant[i] <= 1; bucket[i] <= bucket[i] - req_size[i];
    end
  end
end

8. Page coloring and bank partitioning

HBM3e has 16 banks per stack, 8 stacks. Without coloring, tenant A and B collide on same bank, serializing requests. We assign colors:

tenant 0: banks 0-3, tenant 1: banks 4-7, etc. The OS allocator respects color in physical page allocation. In our modeled system, bank conflict rate dropped from roughly 23 percent to about 4 percent with coloring.

9. Runtime integration

The fabric arbiter exports metrics to MCOS-HFC every 10 ms: bytes served per tenant, queue depth, violations. The power contract negotiator from Essay 4 reads these and computes power = bytes * 0.023 W/GB/s + baseline. If power exceeds budget, it reduces burst_gbps for low priority tenants.

This closes the loop between bandwidth and joules per token.

10. Case studies detailed

Assumptions: GB200 NVL72, 8 tenants all decode except 2 periodic prefills, 50 percent KV spill to CXL 2.0, CXL latency modeled at 320 ns, page coloring enabled, WFQ weights equal.

With CXL spill, tail latency without fabric was about 112 ms in simulation. With fabric and coloring, about 29 ms. Effective HBM utilization rose from roughly 58 percent to about 87 percent in the model, driven by reduced bank conflicts.

Power interaction. During same test, HBM power drops from 78 W average to 69 W because fewer bank conflicts mean fewer row activations. Analytic modeling suggests this could save on the order of 5 to 10 W per GPU from reduced row activations, 648 W per rack, directly improving joules per token by 4.2 percent.

11. Limitations

Token buckets work for steady state, not for synchronized bursts from all tenants. If 8 tenants start prefill simultaneously, sum of guaranteed exceeds capacity. Solution: admission control at scheduler, reject new prefill if sum guaranteed + new > 0.85 * capacity.

12. Conclusion

KV cache bandwidth must be partitioned like network bandwidth. With compiler contracts, token buckets in hardware, and page coloring, multi-tenant tail latency improves 3.6x and throughput improves 28 to 31 percent. Combined with power contracts, this yields a full resource fabric for LLM inference.

Figure 3: Tail latency with and without fabric (simulated)

Simulated H200, 4 tenants, 1 prefill + 3 decode. Bars show p99 decode latency.

Addendum: The Power Contract SLO

Making Joules-per-Token Schedulable from Grid to SRAM

0.1 Primer

Joules per token: Energy divided by useful output tokens
PDU: Power distribution unit
DVFS: Dynamic voltage/frequency scaling
CPO/LPO: Co-packaged optics / linear-drive optics

0.2 Evidence basis and scope

Measured: GB200 120kW, H200 power from vendor datasheets and MLPerf 2025
Trace-driven simulation: throughput under caps, latency from production traces
Analytic estimates: joules per token, capacitor energy
Design proposal: RIIR power contracts, fabric arbiter

1. The power wall

In modeled deployments, a GB200 NVL72 rack draws roughly 120 kW sustained, with peaks near 135 kW during all-reduce. At 10k GPU scale, this approaches 17–20 MW.

2. Why joules per token

Throughput per watt is the right metric. Representative numbers: H200 FP16 ~0.42 J/token, FP8 with LPO ~0.28 J/token, GB200 FP4 with CPO sleep ~0.17 J/token.

4. Why not just throttle

Firmware throttling reacts to temperature, not SLOs. Static caps waste headroom. DVFS alone ignores memory and links which are ~55% of power.

5. Resource Intent Interface

power = { target_jpt=0.30, alternatives=[A, B, C] }

5.1 Where contracts are emitted

In practice, the contract is not emitted by a new compiler. For PyTorch 2.x, this is a TorchInductor post-lowering pass that annotates each kernel with estimated joules and emits RIIR power alternatives as metadata. For XLA/JAX, it is an HLO pass before codegen that inserts power metadata into custom calls. For vLLM and TensorRT-LLM, the runtime reads the metadata at model load and registers contracts with MCOS; the scheduler selects the plan at request dispatch based on live PDU forecasts. The compiler proposes the Pareto frontier, the runtime selects.

6. Compiler generates frontier

Search yields 7–14 Pareto-optimal plans. Model error ~4% for joules in simulation.

8. Handling transients

Two-tier control: 5s forecast for utility, 200µs local override using PSU capacitors (~32J) and arbiter broadcast.

9. Case studies

9.1 GB200 training

Assumptions: 800 racks, simulated, PDU cap 94.5 MW

Effective token throughput per MW rose from ~81% to ~96% of theoretical in simulation.

10. Conclusion

Power is not a limit to hit. It is a budget to allocate. Once joules per token becomes schedulable, the datacenter stops throttling and starts negotiating.

Make power a contract, not a crisis.