The Scheduling Tax of Multi-Tenant Inference: Why Fairness and Throughput Are a Memory Problem

Abstract

Continuous batching treats every request as a scheduling slot. But when a 128K-context request shares an H100 with forty 2K-token requests, the memory footprint ratio is 64:1. The scheduler sees equal citizens; the memory system sees a whale and forty minnows. This essay examines the structural consequences: KV-cache residency asymmetry, eviction economics that differ per request length, thermal and bandwidth concentration from long-context pages, and the implications for SLO enforcement. The argument is not that scheduling is wrong — it is that memory-aware fairness is a distinct design objective that current serving stacks leave implicit.

Two definitions of fair

Every multi-tenant serving system implicitly carries a fairness contract. The scheduler's version is request-level: each request gets a slot, progresses through prefill and decode, and is subject to SLO enforcement on time-to-first-token and inter-token latency. This is the fairness that orchestration layers like vLLM, TensorRT-LLM, and SGLang enforce.

But there is a second fairness surface that these systems leave implicit: memory fairness. A request's KV cache grows linearly with its sequence length. In a paged attention system, a 128K-context request allocates ~4,000 KV pages while a 2K request allocates ~62 pages. Both occupy a single scheduling slot. But the 128K request consumes 64× the fast-tier memory, generates 64× the eviction pressure on co-scheduled requests, and creates 64× the bandwidth demand when its pages are promoted or demoted.

The scheduler sees requests. The memory system sees pages. When request lengths span 64×, fairness measured in requests ≠ fairness measured in bytes.

Figure 1. The scheduler allocates 41 equal slots. The memory system allocates 4,000 pages to one request and 62 pages each to the rest. Memory fairness and scheduling fairness are structurally different quantities.

The asymmetry is structural, not incidental

This is not edge-case reasoning. In production inference pools, sequence-length distributions are heavy-tailed. Retrieval-augmented generation (RAG) systems routinely inject 32K–128K context windows alongside short conversational queries. A single RAG-heavy request can consume more KV-cache memory than the other 50 requests in the same continuous batch combined.

The consequences cascade through three systems surfaces:

1. Eviction economics differ per request length

In a paged KV-cache system with limited fast-tier capacity, the eviction policy must decide which pages to demote. But the cost of evicting a long-context page versus a short-context page is fundamentally different. A long-context page that gets evicted and later re-accessed forces a re-promotion from host RAM — typically 10–50× slower than fast-tier access. But that page belongs to a request that holds 4,000 other pages, many of which may also be candidates for eviction. The eviction policy faces a combinatorial problem: evicting any subset of the whale's pages reduces pressure, but the whale's working set is so large that it will immediately refill the freed capacity.

Short-context requests, meanwhile, have tiny working sets that fit entirely in fast tier. Their eviction is cheap individually but catastrophic collectively — when enough minnow pages are pushed out by the whale, the short requests experience a latency cliff that the scheduler never sees.

2. Bandwidth concentration creates thermal asymmetry

Long-context KV pages are accessed densely during the attention computation. A 128K request touches all 4,000 pages per decode step, generating sustained HBM bandwidth demand that can saturate a single channel. When co-scheduled with short requests, the whale's bandwidth demand can push memory-controller queue depths high enough to create contention — adding latency to short requests that have nothing to do with the whale's work.

This is not hypothetical. On H100 systems with 3.35 TB/s HBM bandwidth, a single long-context request can consume 40–60% of available bandwidth during its attention pass, leaving the remaining 40 requests competing for the remainder.

3. SLO enforcement operates on the wrong metric

SLO measurement in serving stacks is per-request: time-to-first-token (TTFT) and inter-token latency (ITL). But memory unfairness manifests as variance, not mean degradation. Short requests experience occasional latency spikes when their pages are evicted by the whale — spikes that occur at p95/p99 but not at p50. The SLO system sees a request that "occasionally" misses its latency target, without tracing the cause to a memory-residency collision with a co-scheduled long-context request.

Figure 2. Eviction cascade timeline. The whale's prefill saturates the KV pool, forcing minnow pages to host RAM. Minnow ITL shows erratic p99 spikes while the whale's latency remains stable throughout — the whale is never evicted because its pages are always "recently accessed."

What a memory-fair policy would look like

A memory-fair serving system would need to reason about at least three dimensions that current schedulers leave implicit:

Dimension	Current Behavior	Memory-Fair Alternative
Admission control	Admit if a scheduling slot is available	Admit if KV-page budget allows without exceeding per-request memory quota
Eviction priority	LRU across all pages uniformly	Eviction weighted by request's total memory footprint — whale pages evicted before minnow pages under pressure
Bandwidth accounting	Not tracked per request	Per-request bandwidth metering to prevent one request from saturating memory controllers
SLO attribution	Per-request latency only	Latency attributed to memory residency state — SLO violations traced to eviction causes

The critical insight is that memory fairness is not about limiting long-context requests — they are valuable workload. It is about making the cost of their memory consumption visible to the scheduling and eviction logic, so that the system can make informed tradeoffs rather than imposing silent costs on co-tenants.

Design principle: A request's scheduling priority and its memory entitlement are separate knobs. Conflating them — treating a scheduling slot as a memory entitlement — is the architectural decision that creates the asymmetry.

The connection to KV-cache eviction research

This framing recontextualizes eviction policy work (including regret-aware eviction in kv-hierarchy-lab) in a multi-tenant setting. When eviction policies operate on pages without awareness of which request owns them, they implicitly favor whales: the whale's pages are always "recently accessed" (because the whale touches all of them every decode step), so LRU will always evict minnow pages first — even if the minnow pages are individually more valuable to their owning requests.

A request-aware eviction policy would weight eviction scores by the owning request's total footprint: a page belonging to a 4,000-page request is cheaper to evict (per-page) than a page belonging to a 62-page request, because the large request has far more residual capacity to absorb the miss. This is the memory-fairness analogue of progressive taxation: larger consumers pay proportionally more for the shared resource they dominate.

What remains open

This essay describes a problem, not a complete solution. The design space for memory-fair multi-tenant inference includes several open questions:

Quota granularity: Should memory quotas be per-request, per-user, per-priority-class, or per-SLO tier? Each granularity creates different fairness properties and different implementation complexity.
Dynamic vs. static quotas: Fixed per-request memory budgets are simple but wasteful — a short request never uses its full quota. Dynamic quotas that grow with sequence length but are bounded by the pool's capacity require real-time accounting.
Fairness vs. throughput: Strict memory fairness may reduce overall throughput by preventing the system from fully utilizing fast-tier capacity. The optimal point on the fairness-throughput frontier is workload-dependent and likely requires SLO-aware tuning.
Interaction with prefill-decode disaggregation: In disaggregated architectures where prefill and decode run on separate hardware, the memory-fairness problem shifts — the decode pool inherits the KV pages from prefill without having participated in the admission decision.

The scheduler's job is to decide when and where to run a request. The memory system's job is to decide which bytes stay hot. When these two systems disagree about fairness, the memory system always wins — because latency is measured in bytes moved, not slots allocated.

multi-tenant-inference-memory-fairness.html · April 2026 · ← All writings