← All posts
LLM Serving · Memory Systems · Fairness

The Scheduling Tax of Multi-Tenant Inference

Published · 8 min read

Why fairness and throughput in shared inference pools are fundamentally a memory problem — and why the scheduler's view of "fair" diverges from the memory system's reality by up to 64×.

By Manish KL ~16 min read Technical Essay
Abstract

Continuous batching treats every request as a scheduling slot. But when a 128K-context request shares an H100 with forty 2K-token requests, the memory footprint ratio is 64:1. The scheduler sees equal citizens; the memory system sees a whale and forty minnows. This essay examines the structural consequences: KV-cache residency asymmetry, eviction economics that differ per request length, thermal and bandwidth concentration from long-context pages, and the implications for SLO enforcement. The argument is not that scheduling is wrong — it is that memory-aware fairness is a distinct design objective that current serving stacks leave implicit.

64×
KV memory ratio between a 128K and 2K request
~80%
HBM consumed by the top 5% longest requests in mixed pools
2–4×
eviction rate increase for short requests when whales are co-scheduled
p99
tail latency where memory unfairness surfaces first

Two definitions of fair

Every multi-tenant serving system implicitly carries a fairness contract. The scheduler's version is request-level: each request gets a slot, progresses through prefill and decode, and is subject to SLO enforcement on time-to-first-token and inter-token latency. This is the fairness that orchestration layers like vLLM, TensorRT-LLM, and SGLang enforce.

But there is a second fairness surface that these systems leave implicit: memory fairness. A request's KV cache grows linearly with its sequence length. In a paged attention system, a 128K-context request allocates ~4,000 KV pages while a 2K request allocates ~62 pages. Both occupy a single scheduling slot. But the 128K request consumes 64× the fast-tier memory, generates 64× the eviction pressure on co-scheduled requests, and creates 64× the bandwidth demand when its pages are promoted or demoted.

The scheduler sees requests. The memory system sees pages. When request lengths span 64×, fairness measured in requests ≠ fairness measured in bytes.
Scheduler View vs Memory View same GPU, same batch — radically different resource consumption SCHEDULER VIEW each request = 1 slot (equal) 128K req 2K 2K 2K 2K ×36 "41 requests, all progressing" scheduler: fair ✓ MEMORY VIEW each page = real HBM consumption 128K request — 4,000 pages (83% of KV pool) 40 × 2K requests — 2,480 pages (17%) eviction pressure: whale's pages push minnow pages to slower tiers "1 request consumes 83% of KV memory" memory: deeply unfair ✗
Figure 1. The scheduler allocates 41 equal slots. The memory system allocates 4,000 pages to one request and 62 pages each to the rest. Memory fairness and scheduling fairness are structurally different quantities.

The asymmetry is structural, not incidental

This is not edge-case reasoning. In production inference pools, sequence-length distributions are heavy-tailed. Retrieval-augmented generation (RAG) systems routinely inject 32K–128K context windows alongside short conversational queries. A single RAG-heavy request can consume more KV-cache memory than the other 50 requests in the same continuous batch combined.

The consequences cascade through three systems surfaces:

1. Eviction economics differ per request length

In a paged KV-cache system with limited fast-tier capacity, the eviction policy must decide which pages to demote. But the cost of evicting a long-context page versus a short-context page is fundamentally different. A long-context page that gets evicted and later re-accessed forces a re-promotion from host RAM — typically 10–50× slower than fast-tier access. But that page belongs to a request that holds 4,000 other pages, many of which may also be candidates for eviction. The eviction policy faces a combinatorial problem: evicting any subset of the whale's pages reduces pressure, but the whale's working set is so large that it will immediately refill the freed capacity.

Short-context requests, meanwhile, have tiny working sets that fit entirely in fast tier. Their eviction is cheap individually but catastrophic collectively — when enough minnow pages are pushed out by the whale, the short requests experience a latency cliff that the scheduler never sees.

2. Bandwidth concentration creates thermal asymmetry

Long-context KV pages are accessed densely during the attention computation. A 128K request touches all 4,000 pages per decode step, generating sustained HBM bandwidth demand that can saturate a single channel. When co-scheduled with short requests, the whale's bandwidth demand can push memory-controller queue depths high enough to create contention — adding latency to short requests that have nothing to do with the whale's work.

This is not hypothetical. On H100 systems with 3.35 TB/s HBM bandwidth, a single long-context request can consume 40–60% of available bandwidth during its attention pass, leaving the remaining 40 requests competing for the remainder.

3. SLO enforcement operates on the wrong metric

SLO measurement in serving stacks is per-request: time-to-first-token (TTFT) and inter-token latency (ITL). But memory unfairness manifests as variance, not mean degradation. Short requests experience occasional latency spikes when their pages are evicted by the whale — spikes that occur at p95/p99 but not at p50. The SLO system sees a request that "occasionally" misses its latency target, without tracing the cause to a memory-residency collision with a co-scheduled long-context request.

Eviction Cascade: How One Whale Pushes Out Forty Minnows decode step → whale prefill: 4,000 pages allocated KV pool > 90% full → eviction starts minnow pages evicted to host RAM minnow p99 spike high low minnow ITL latency over time → minnow ITL whale ITL
Figure 2. Eviction cascade timeline. The whale's prefill saturates the KV pool, forcing minnow pages to host RAM. Minnow ITL shows erratic p99 spikes while the whale's latency remains stable throughout — the whale is never evicted because its pages are always "recently accessed."

What a memory-fair policy would look like

A memory-fair serving system would need to reason about at least three dimensions that current schedulers leave implicit:

DimensionCurrent BehaviorMemory-Fair Alternative
Admission controlAdmit if a scheduling slot is availableAdmit if KV-page budget allows without exceeding per-request memory quota
Eviction priorityLRU across all pages uniformlyEviction weighted by request's total memory footprint — whale pages evicted before minnow pages under pressure
Bandwidth accountingNot tracked per requestPer-request bandwidth metering to prevent one request from saturating memory controllers
SLO attributionPer-request latency onlyLatency attributed to memory residency state — SLO violations traced to eviction causes

The critical insight is that memory fairness is not about limiting long-context requests — they are valuable workload. It is about making the cost of their memory consumption visible to the scheduling and eviction logic, so that the system can make informed tradeoffs rather than imposing silent costs on co-tenants.

Design principle: A request's scheduling priority and its memory entitlement are separate knobs. Conflating them — treating a scheduling slot as a memory entitlement — is the architectural decision that creates the asymmetry.

The connection to KV-cache eviction research

This framing recontextualizes eviction policy work (including regret-aware eviction in kv-hierarchy-lab) in a multi-tenant setting. When eviction policies operate on pages without awareness of which request owns them, they implicitly favor whales: the whale's pages are always "recently accessed" (because the whale touches all of them every decode step), so LRU will always evict minnow pages first — even if the minnow pages are individually more valuable to their owning requests.

A request-aware eviction policy would weight eviction scores by the owning request's total footprint: a page belonging to a 4,000-page request is cheaper to evict (per-page) than a page belonging to a 62-page request, because the large request has far more residual capacity to absorb the miss. This is the memory-fairness analogue of progressive taxation: larger consumers pay proportionally more for the shared resource they dominate.

What remains open

This essay describes a problem, not a complete solution. The design space for memory-fair multi-tenant inference includes several open questions:

The scheduler's job is to decide when and where to run a request. The memory system's job is to decide which bytes stay hot. When these two systems disagree about fairness, the memory system always wins — because latency is measured in bytes moved, not slots allocated.