Two definitions of fair
Every multi-tenant serving system implicitly carries a fairness contract. The scheduler's version is request-level: each request gets a slot, progresses through prefill and decode, and is subject to SLO enforcement on time-to-first-token and inter-token latency. This is the fairness that orchestration layers like vLLM, TensorRT-LLM, and SGLang enforce.
But there is a second fairness surface that these systems leave implicit: memory fairness. A request's KV cache grows linearly with its sequence length. In a paged attention system, a 128K-context request allocates ~4,000 KV pages while a 2K request allocates ~62 pages. Both occupy a single scheduling slot. But the 128K request consumes 64× the fast-tier memory, generates 64× the eviction pressure on co-scheduled requests, and creates 64× the bandwidth demand when its pages are promoted or demoted.
The scheduler sees requests. The memory system sees pages. When request lengths span 64×, fairness measured in requests ≠ fairness measured in bytes.
The asymmetry is structural, not incidental
This is not edge-case reasoning. In production inference pools, sequence-length distributions are heavy-tailed. Retrieval-augmented generation (RAG) systems routinely inject 32K–128K context windows alongside short conversational queries. A single RAG-heavy request can consume more KV-cache memory than the other 50 requests in the same continuous batch combined.
The consequences cascade through three systems surfaces:
1. Eviction economics differ per request length
In a paged KV-cache system with limited fast-tier capacity, the eviction policy must decide which pages to demote. But the cost of evicting a long-context page versus a short-context page is fundamentally different. A long-context page that gets evicted and later re-accessed forces a re-promotion from host RAM — typically 10–50× slower than fast-tier access. But that page belongs to a request that holds 4,000 other pages, many of which may also be candidates for eviction. The eviction policy faces a combinatorial problem: evicting any subset of the whale's pages reduces pressure, but the whale's working set is so large that it will immediately refill the freed capacity.
Short-context requests, meanwhile, have tiny working sets that fit entirely in fast tier. Their eviction is cheap individually but catastrophic collectively — when enough minnow pages are pushed out by the whale, the short requests experience a latency cliff that the scheduler never sees.
2. Bandwidth concentration creates thermal asymmetry
Long-context KV pages are accessed densely during the attention computation. A 128K request touches all 4,000 pages per decode step, generating sustained HBM bandwidth demand that can saturate a single channel. When co-scheduled with short requests, the whale's bandwidth demand can push memory-controller queue depths high enough to create contention — adding latency to short requests that have nothing to do with the whale's work.
This is not hypothetical. On H100 systems with 3.35 TB/s HBM bandwidth, a single long-context request can consume 40–60% of available bandwidth during its attention pass, leaving the remaining 40 requests competing for the remainder.
3. SLO enforcement operates on the wrong metric
SLO measurement in serving stacks is per-request: time-to-first-token (TTFT) and inter-token latency (ITL). But memory unfairness manifests as variance, not mean degradation. Short requests experience occasional latency spikes when their pages are evicted by the whale — spikes that occur at p95/p99 but not at p50. The SLO system sees a request that "occasionally" misses its latency target, without tracing the cause to a memory-residency collision with a co-scheduled long-context request.
What a memory-fair policy would look like
A memory-fair serving system would need to reason about at least three dimensions that current schedulers leave implicit:
| Dimension | Current Behavior | Memory-Fair Alternative |
|---|---|---|
| Admission control | Admit if a scheduling slot is available | Admit if KV-page budget allows without exceeding per-request memory quota |
| Eviction priority | LRU across all pages uniformly | Eviction weighted by request's total memory footprint — whale pages evicted before minnow pages under pressure |
| Bandwidth accounting | Not tracked per request | Per-request bandwidth metering to prevent one request from saturating memory controllers |
| SLO attribution | Per-request latency only | Latency attributed to memory residency state — SLO violations traced to eviction causes |
The critical insight is that memory fairness is not about limiting long-context requests — they are valuable workload. It is about making the cost of their memory consumption visible to the scheduling and eviction logic, so that the system can make informed tradeoffs rather than imposing silent costs on co-tenants.
The connection to KV-cache eviction research
This framing recontextualizes eviction policy work (including regret-aware eviction in kv-hierarchy-lab) in a multi-tenant setting. When eviction policies operate on pages without awareness of which request owns them, they implicitly favor whales: the whale's pages are always "recently accessed" (because the whale touches all of them every decode step), so LRU will always evict minnow pages first — even if the minnow pages are individually more valuable to their owning requests.
A request-aware eviction policy would weight eviction scores by the owning request's total footprint: a page belonging to a 4,000-page request is cheaper to evict (per-page) than a page belonging to a 62-page request, because the large request has far more residual capacity to absorb the miss. This is the memory-fairness analogue of progressive taxation: larger consumers pay proportionally more for the shared resource they dominate.
What remains open
This essay describes a problem, not a complete solution. The design space for memory-fair multi-tenant inference includes several open questions:
- Quota granularity: Should memory quotas be per-request, per-user, per-priority-class, or per-SLO tier? Each granularity creates different fairness properties and different implementation complexity.
- Dynamic vs. static quotas: Fixed per-request memory budgets are simple but wasteful — a short request never uses its full quota. Dynamic quotas that grow with sequence length but are bounded by the pool's capacity require real-time accounting.
- Fairness vs. throughput: Strict memory fairness may reduce overall throughput by preventing the system from fully utilizing fast-tier capacity. The optimal point on the fairness-throughput frontier is workload-dependent and likely requires SLO-aware tuning.
- Interaction with prefill-decode disaggregation: In disaggregated architectures where prefill and decode run on separate hardware, the memory-fairness problem shifts — the decode pool inherits the KV pages from prefill without having participated in the admission decision.
The scheduler's job is to decide when and where to run a request. The memory system's job is to decide which bytes stay hot. When these two systems disagree about fairness, the memory system always wins — because latency is measured in bytes moved, not slots allocated.