Inference Is a Queueing System
Most teams still talk about serving in the language of model size, GPU count, and tokens per second. Production inference lives somewhere harsher: arrivals bunch, queues form, memory stalls stretch service times, and p99 blows up long before theoretical throughput is reached.
1. Why serving breaks before GPUs look “full”
The most misleading chart in AI infrastructure is often the “GPU utilization” chart. A fleet can show moderate utilization and still deliver awful user experience because the problem is not simply whether the accelerator is doing work. The problem is whether requests are waiting in line, whether batches are forming coherently, and whether the system can keep service times stable as the request mix changes.
Production inference is not one clean workload. It is a moving superposition of prompt lengths, output lengths, arrival bursts, tenant interference, cache locality, routing randomness, and transport delays. That means the system is better understood as a queueing network than as a single giant matrix multiply.
This is why teams get blindsided. They size the fleet using average tokens per second, then discover that user-visible quality is set by the worst 1% to 5% of interactions. The queue does not care about average optimism. It cares about burstiness, variance, and whether the system can shed or reshape load before waiting time compounds.
A concrete serving-day failure mode
Imagine a fleet sized for an average of 3,000 active decode streams with enough headroom for prompt bursts. At noon, one product surface launches a feature and request arrivals briefly double. The first thing that happens is not “the GPU maxes out.” The first thing that happens is queue depth rises at ingress, then batch formation windows stretch, then short requests start waiting behind longer ones, and finally decode loops inherit the instability as uneven work arrives at each step. By the time utilization dashboards look alarming, the user-facing damage is already underway.
This is one reason interactive systems feel brittle. A few seconds of demand distortion can produce minutes of recovery because the system is not draining from a clean state. It is draining from an inflated queue whose requests now carry more waiting time, more cancellation risk, and often more retries.
This is also why “autoscaling saved us” is often a post-hoc illusion. New capacity can help, but capacity that comes online after the queue has already inflated is arriving late to a system that is now spending part of its effort servicing stale requests, retries, or traffic that is no longer evenly shaped. The queue carries memory.
Inference quality emerges from waiting time plus service-time variability across multiple stages, not from one average kernel speed number.
2. Little’s Law is the missing primitive
A lot of inference design confusion clears up once you bring back one boring but brutal identity:
L = λW. Average queue length equals arrival rate times average time in system. That one relation
does not tell you everything, but it forces discipline. If arrival rate rises or if service time stretches,
the number of requests in system grows. That growth appears to the user as “the model feels slow.”
In serving, W is not just compute time. It is admission delay, batching delay, scheduler delay,
transfer delay, prefill time, decode iterations, and any waiting induced by cache or memory pressure.
Once you look at the full path, it becomes obvious why small inefficiencies accumulate into visible tail pain.
Traffic is rarely Poisson and rarely polite. Interactive products create synchronized spikes around human behavior, retries, and product events.
Prompt length, decode length, cache residency, expert routing, and interconnect topology all perturb the work per request.
The tail is dominated not by the mean alone, but by how uneven arrivals and service times become under contention.
This is one reason “more batching” is not a universal answer. Batching can improve arithmetic efficiency, but it also increases admission delay and couples short jobs to long jobs. If the fleet is operating near saturation, even small batch-formation delays can push p95 and p99 into uncomfortable territory.
A simple quantitative intuition
Suppose effective arrival rate is 900 requests per second and the average time in system is 120 milliseconds. Little’s Law says the system carries about 108 requests in flight on average. If memory stalls or batching overhead stretch average time in system to 180 milliseconds, in-flight work rises to about 162 requests without any increase in demand. That extra backlog is exactly what operators experience as “mysterious” saturation.
| Arrival rate | Average time in system | Average in-flight work | Operator-visible effect |
|---|---|---|---|
| 900 req/s | 120 ms | 108 requests | Stable queue, modest tail |
| 900 req/s | 180 ms | 162 requests | Longer waiting and stickier backlog |
| 1,050 req/s | 180 ms | 189 requests | Tail starts to climb sharply |
The deeper lesson is that you cannot reason about serving quality from tokens-per-second alone. You need a mental model of how waiting time changes as traffic class, prompt mix, and memory behavior distort service time. Otherwise the system always looks “fine” until it very much is not.
3. Prefill and decode are different service centers
One of the clearest mistakes in inference thinking is to treat prefill and decode as one homogeneous stream of work. They are not. Prefill is a larger, denser pass over the prompt. Decode is an iterative, token-by-token control loop whose performance depends heavily on memory movement and residency behavior.
Once those two phases are separated conceptually, a lot of architecture choices become more legible. It becomes obvious why prefill/decode disaggregation exists, why cache transfer becomes a network problem, why scheduler policy matters so much, and why “GPU underutilization” can mean very different things in each phase. A half-idle prefill pool and a decode pool with unstable queue growth are not interchangeable pathologies.
Large prompts create chunky work units. Admission and segmentation matter because a small number of long prompts can dominate a dispatch interval.
Each stream contributes many small iterations, so instability shows up as accumulated jitter rather than one giant stall event.
Why this breaks naive fleet planning
If prefill and decode are combined into one resource pool, the scheduler tends to trade one kind of pain for another. Dense prompt bursts can monopolize admission opportunities while decode streams accumulate small but compounding waits. Conversely, protecting decode too aggressively can cause prompt ingress to become spiky enough that batch formation quality collapses. The right answer depends on request mix, but the wrong answer is to assume there is one universal pool-level optimum.
| Dimension | Prefill | Decode |
|---|---|---|
| Primary bottleneck | Compute throughput and prompt ingestion | Memory bandwidth, cache locality, scheduler stability |
| Best-case behavior | Large dense batches | Stable small-step iteration with low queue jitter |
| Typical failure mode | Prompt spikes overwhelm admission | Tail latency stretches as batches, misses, and waits compound |
| Control priority | Shape ingress and isolate heavy prompts | Protect hot state and minimize service-time variance |
4. Batching helps until it quietly hurts
Batching is one of the most misunderstood levers in AI serving. In the happy-path story, more requests per batch means better GPU efficiency. In reality, batching is a trade: you gain device efficiency by accepting some waiting and by coupling multiple requests into one service decision.
That trade becomes dangerous when the request mix is heterogeneous. Short queries can get pinned behind long prompts. Decode iterations for one group can delay dispatch of another. Admission windows that look harmless in aggregate can create visible jitter for interactive users.
Batching is not a one-direction optimization. It is a balance between arithmetic gain and latency amplification.
Mature systems therefore segment traffic. They isolate prompt classes, separate latency-sensitive and throughput-oriented tenants, and treat batch formation as a policy problem rather than a hard-coded kernel preference. The scheduler needs to know when to refuse batch growth because the tail is more expensive than the extra throughput.
What the batch window is really doing
The batch window is a latency budget auction. Every extra millisecond spent waiting for another request is a bet that the future arithmetic gain will more than repay the waiting cost. That bet can make sense for long throughput-oriented generations, and make no sense at all for short interactive turns. Strong serving systems therefore let the window vary by traffic class instead of pretending one global batching rule is intelligent.
| Traffic class | What batching should optimize | What it should avoid |
|---|---|---|
| Interactive chat | Predictable latency and bounded jitter | Long waiting windows for marginal device gain |
| Background summarization | Dense device efficiency | Overprotecting p99 at the cost of large idle gaps |
| Long-context analytics | Memory-aware grouping and stable decode pacing | Mixing wildly different prompt and output geometries blindly |
5. Memory pressure changes the service-time distribution
Queueing discussions often assume service time is mostly compute. In modern inference, that assumption fails precisely where real systems hurt. KV residency, weight streaming, interconnect congestion, and remote fetch all widen the service-time distribution. That widening matters because tails react more violently to variance than to mean alone.
A decode step that is usually fast but occasionally incurs a miss to HBM, host memory, or a remote shard is not just “a little slower.” It is injecting randomness into the system. Randomness inflates queue depth, queue depth increases waiting, waiting worsens batching dynamics, and the user experiences an apparently mysterious latency cliff.
Hot KV or repeated weights leave the fastest tier, so each decode step pays extra movement cost.
Requests land on nodes whose fabric path to state is longer or more congested than the average case.
Work classes with incompatible service profiles are mixed into the same queues and batches.
Timeouts and cancellations feed back into the system as new arrivals, which raises load during degradation.
Once you see memory as a source of service-time variance, the correct control priorities change. Protecting residency, reducing remote fetch probability, and making scheduling more topology-aware are queue-stability measures, not just micro-optimizations.
The tail-explosion path
The dangerous pattern is not one isolated miss. It is a miss that widens one request’s service time, which then extends the batch, which then delays the next dispatch opportunity, which then increases waiting for other requests. Under enough concurrency, the queue turns a local memory event into a fleet-level latency event. This is why long-context decode often feels stable until it suddenly does not.
The system-level damage comes from amplification, not just the first miss event.
6. What a sane inference control plane should optimize
A good inference control plane does not merely maximize average throughput. It manages a queueing network with asymmetric phases, memory-sensitive service times, and human-visible tail constraints. That means the control plane needs policy over arrival shaping, class isolation, queue admission, batch windows, topology placement, and state protection.
What this means in practice
Separate tenants and request classes early. Admission control should begin before the accelerator boundary.
Residency policy is queue policy. Stable memory behavior keeps decode service times from widening.
p95 and p99 need first-class SLO handling. Averages are not a sufficient steering signal.
Metrics that actually deserve attention
The dashboards that matter most are usually not the glamorous ones. Queue depth by class, time spent waiting before batch admission, service-time variance for decode, KV transfer frequency, cancellation rate, and retry amplification tell you far more about serving quality than a single fleetwide utilization number. The point is not to collect more graphs. It is to measure where the queue is being distorted.
A practical control-loop checklist
Classify the request, estimate prompt geometry, and decide whether it belongs in an interactive lane, a throughput lane, or a specialized long-context lane.
Check queue age, batch window, and state locality together. Do not let a numerically attractive batch hide a bad waiting-time decision.
Watch for widening decode variance, transfer spikes, and repeated state misses. Those are leading indicators of queue instability.
Shed, isolate, or reshape work early. Waiting too long often means retries and stale queue contents make the recovery problem larger.
This also explains why the most interesting serving companies and internal infrastructure teams are becoming scheduling companies. Once the math path is “good enough,” operational excellence shifts to controlling the geometry of work as it enters, waits, batches, moves, and exits.
Large-model inference will keep looking like a compute story from a distance. Up close, it is a queueing story shaped by memory, transport, and policy. The winning systems will not simply own faster chips. They will own more stable queues.
References and further reading
- John D. C. Little, “A Proof for the Queueing Formula: L = λW.”
- Mor Harchol-Balter, Performance Modeling and Design of Computer Systems.
- Leonard Kleinrock, foundational queueing systems texts.
- vLLM and PagedAttention papers and engineering materials for LLM serving behavior.
- NVIDIA TensorRT-LLM, Triton Inference Server, and CUDA performance guidance for batching and serving.
- Public engineering writing on prefill/decode disaggregation, KV transfer, and long-context inference serving.
- Classical work on tail latency, datacenter queueing, and service-time variability in distributed systems.
- Dean and Barroso, “The Tail at Scale,” for why small variance matters so much in large interactive systems.
- Production guidance from hyperscaler and inference-runtime teams on admission control, retry storms, and latency SLO design.