Inference systems · queueing · tail latency

Inference Is a Queueing System

Most teams still talk about serving in the language of model size, GPU count, and tokens per second. Production inference lives somewhere harsher: arrivals bunch, queues form, memory stalls stretch service times, and p99 blows up long before theoretical throughput is reached.

Core claimLatency is a queueing outcome, not a single-kernel outcome

Practical lensPrefill, decode, batching, and routing are coupled queues

Operator takeawayControl arrival geometry or pay at the tail

1. Why serving breaks before GPUs look “full”

The most misleading chart in AI infrastructure is often the “GPU utilization” chart. A fleet can show moderate utilization and still deliver awful user experience because the problem is not simply whether the accelerator is doing work. The problem is whether requests are waiting in line, whether batches are forming coherently, and whether the system can keep service times stable as the request mix changes.

Production inference is not one clean workload. It is a moving superposition of prompt lengths, output lengths, arrival bursts, tenant interference, cache locality, routing randomness, and transport delays. That means the system is better understood as a queueing network than as a single giant matrix multiply.

A serving stack fails at the tail when arrival rate approaches unstable effective service rate, even if average accelerator occupancy still looks “reasonable.”

This is why teams get blindsided. They size the fleet using average tokens per second, then discover that user-visible quality is set by the worst 1% to 5% of interactions. The queue does not care about average optimism. It cares about burstiness, variance, and whether the system can shed or reshape load before waiting time compounds.

A concrete serving-day failure mode

Imagine a fleet sized for an average of 3,000 active decode streams with enough headroom for prompt bursts. At noon, one product surface launches a feature and request arrivals briefly double. The first thing that happens is not “the GPU maxes out.” The first thing that happens is queue depth rises at ingress, then batch formation windows stretch, then short requests start waiting behind longer ones, and finally decode loops inherit the instability as uneven work arrives at each step. By the time utilization dashboards look alarming, the user-facing damage is already underway.

This is one reason interactive systems feel brittle. A few seconds of demand distortion can produce minutes of recovery because the system is not draining from a clean state. It is draining from an inflated queue whose requests now carry more waiting time, more cancellation risk, and often more retries.

This is also why “autoscaling saved us” is often a post-hoc illusion. New capacity can help, but capacity that comes online after the queue has already inflated is arriving late to a system that is now spending part of its effort servicing stale requests, retries, or traffic that is no longer evenly shaped. The queue carries memory.

Inference quality emerges from waiting time plus service-time variability across multiple stages, not from one average kernel speed number.

2. Little’s Law is the missing primitive

A lot of inference design confusion clears up once you bring back one boring but brutal identity: L = λW. Average queue length equals arrival rate times average time in system. That one relation does not tell you everything, but it forces discipline. If arrival rate rises or if service time stretches, the number of requests in system grows. That growth appears to the user as “the model feels slow.”

In serving, W is not just compute time. It is admission delay, batching delay, scheduler delay, transfer delay, prefill time, decode iterations, and any waiting induced by cache or memory pressure. Once you look at the full path, it becomes obvious why small inefficiencies accumulate into visible tail pain.

Arrival rate

Traffic is rarely Poisson and rarely polite. Interactive products create synchronized spikes around human behavior, retries, and product events.

Service time

Prompt length, decode length, cache residency, expert routing, and interconnect topology all perturb the work per request.

Variance

The tail is dominated not by the mean alone, but by how uneven arrivals and service times become under contention.

This is one reason “more batching” is not a universal answer. Batching can improve arithmetic efficiency, but it also increases admission delay and couples short jobs to long jobs. If the fleet is operating near saturation, even small batch-formation delays can push p95 and p99 into uncomfortable territory.

A simple quantitative intuition

Suppose effective arrival rate is 900 requests per second and the average time in system is 120 milliseconds. Little’s Law says the system carries about 108 requests in flight on average. If memory stalls or batching overhead stretch average time in system to 180 milliseconds, in-flight work rises to about 162 requests without any increase in demand. That extra backlog is exactly what operators experience as “mysterious” saturation.

Arrival rate	Average time in system	Average in-flight work	Operator-visible effect
900 req/s	120 ms	108 requests	Stable queue, modest tail
900 req/s	180 ms	162 requests	Longer waiting and stickier backlog
1,050 req/s	180 ms	189 requests	Tail starts to climb sharply

The deeper lesson is that you cannot reason about serving quality from tokens-per-second alone. You need a mental model of how waiting time changes as traffic class, prompt mix, and memory behavior distort service time. Otherwise the system always looks “fine” until it very much is not.

3. Prefill and decode are different service centers

One of the clearest mistakes in inference thinking is to treat prefill and decode as one homogeneous stream of work. They are not. Prefill is a larger, denser pass over the prompt. Decode is an iterative, token-by-token control loop whose performance depends heavily on memory movement and residency behavior.

Prefill tends to reward dense batching and high arithmetic throughput. Decode rewards predictable memory access, low queue jitter, and careful admission control.

Once those two phases are separated conceptually, a lot of architecture choices become more legible. It becomes obvious why prefill/decode disaggregation exists, why cache transfer becomes a network problem, why scheduler policy matters so much, and why “GPU underutilization” can mean very different things in each phase. A half-idle prefill pool and a decode pool with unstable queue growth are not interchangeable pathologies.

Prefill queues look lumpy

Large prompts create chunky work units. Admission and segmentation matter because a small number of long prompts can dominate a dispatch interval.

Decode queues look persistent

Each stream contributes many small iterations, so instability shows up as accumulated jitter rather than one giant stall event.

Why this breaks naive fleet planning

If prefill and decode are combined into one resource pool, the scheduler tends to trade one kind of pain for another. Dense prompt bursts can monopolize admission opportunities while decode streams accumulate small but compounding waits. Conversely, protecting decode too aggressively can cause prompt ingress to become spiky enough that batch formation quality collapses. The right answer depends on request mix, but the wrong answer is to assume there is one universal pool-level optimum.

Dimension	Prefill	Decode
Primary bottleneck	Compute throughput and prompt ingestion	Memory bandwidth, cache locality, scheduler stability
Best-case behavior	Large dense batches	Stable small-step iteration with low queue jitter
Typical failure mode	Prompt spikes overwhelm admission	Tail latency stretches as batches, misses, and waits compound
Control priority	Shape ingress and isolate heavy prompts	Protect hot state and minimize service-time variance

4. Batching helps until it quietly hurts

Batching is one of the most misunderstood levers in AI serving. In the happy-path story, more requests per batch means better GPU efficiency. In reality, batching is a trade: you gain device efficiency by accepting some waiting and by coupling multiple requests into one service decision.

That trade becomes dangerous when the request mix is heterogeneous. Short queries can get pinned behind long prompts. Decode iterations for one group can delay dispatch of another. Admission windows that look harmless in aggregate can create visible jitter for interactive users.

Batching is not a one-direction optimization. It is a balance between arithmetic gain and latency amplification.

The right question is not “How large can I make the batch?” The right question is “How much waiting am I willing to buy for the next unit of efficiency?”

Mature systems therefore segment traffic. They isolate prompt classes, separate latency-sensitive and throughput-oriented tenants, and treat batch formation as a policy problem rather than a hard-coded kernel preference. The scheduler needs to know when to refuse batch growth because the tail is more expensive than the extra throughput.

What the batch window is really doing

The batch window is a latency budget auction. Every extra millisecond spent waiting for another request is a bet that the future arithmetic gain will more than repay the waiting cost. That bet can make sense for long throughput-oriented generations, and make no sense at all for short interactive turns. Strong serving systems therefore let the window vary by traffic class instead of pretending one global batching rule is intelligent.

Traffic class	What batching should optimize	What it should avoid
Interactive chat	Predictable latency and bounded jitter	Long waiting windows for marginal device gain
Background summarization	Dense device efficiency	Overprotecting p99 at the cost of large idle gaps
Long-context analytics	Memory-aware grouping and stable decode pacing	Mixing wildly different prompt and output geometries blindly

5. Memory pressure changes the service-time distribution

Queueing discussions often assume service time is mostly compute. In modern inference, that assumption fails precisely where real systems hurt. KV residency, weight streaming, interconnect congestion, and remote fetch all widen the service-time distribution. That widening matters because tails react more violently to variance than to mean alone.

A decode step that is usually fast but occasionally incurs a miss to HBM, host memory, or a remote shard is not just “a little slower.” It is injecting randomness into the system. Randomness inflates queue depth, queue depth increases waiting, waiting worsens batching dynamics, and the user experiences an apparently mysterious latency cliff.

Residency failure

Hot KV or repeated weights leave the fastest tier, so each decode step pays extra movement cost.

Topology failure

Requests land on nodes whose fabric path to state is longer or more congested than the average case.

Scheduler failure

Work classes with incompatible service profiles are mixed into the same queues and batches.

Retry failure

Timeouts and cancellations feed back into the system as new arrivals, which raises load during degradation.

Once you see memory as a source of service-time variance, the correct control priorities change. Protecting residency, reducing remote fetch probability, and making scheduling more topology-aware are queue-stability measures, not just micro-optimizations.

The tail-explosion path

The dangerous pattern is not one isolated miss. It is a miss that widens one request’s service time, which then extends the batch, which then delays the next dispatch opportunity, which then increases waiting for other requests. Under enough concurrency, the queue turns a local memory event into a fleet-level latency event. This is why long-context decode often feels stable until it suddenly does not.

The system-level damage comes from amplification, not just the first miss event.

6. What a sane inference control plane should optimize

A good inference control plane does not merely maximize average throughput. It manages a queueing network with asymmetric phases, memory-sensitive service times, and human-visible tail constraints. That means the control plane needs policy over arrival shaping, class isolation, queue admission, batch windows, topology placement, and state protection.

The job of inference infrastructure is not “keep GPUs busy.” It is “turn bursty demand into stable useful throughput while preserving latency where the product actually cares.”

What this means in practice

Shape demand

Separate tenants and request classes early. Admission control should begin before the accelerator boundary.

Protect hot state

Residency policy is queue policy. Stable memory behavior keeps decode service times from widening.

Optimize tails explicitly

p95 and p99 need first-class SLO handling. Averages are not a sufficient steering signal.

Metrics that actually deserve attention

The dashboards that matter most are usually not the glamorous ones. Queue depth by class, time spent waiting before batch admission, service-time variance for decode, KV transfer frequency, cancellation rate, and retry amplification tell you far more about serving quality than a single fleetwide utilization number. The point is not to collect more graphs. It is to measure where the queue is being distorted.

A practical control-loop checklist

Before admission

Classify the request, estimate prompt geometry, and decide whether it belongs in an interactive lane, a throughput lane, or a specialized long-context lane.

Before dispatch

Check queue age, batch window, and state locality together. Do not let a numerically attractive batch hide a bad waiting-time decision.

During execution

Watch for widening decode variance, transfer spikes, and repeated state misses. Those are leading indicators of queue instability.

After degradation starts

Shed, isolate, or reshape work early. Waiting too long often means retries and stale queue contents make the recovery problem larger.

This also explains why the most interesting serving companies and internal infrastructure teams are becoming scheduling companies. Once the math path is “good enough,” operational excellence shifts to controlling the geometry of work as it enters, waits, batches, moves, and exits.

Large-model inference will keep looking like a compute story from a distance. Up close, it is a queueing story shaped by memory, transport, and policy. The winning systems will not simply own faster chips. They will own more stable queues.

References and further reading

John D. C. Little, “A Proof for the Queueing Formula: L = λW.”
Mor Harchol-Balter, Performance Modeling and Design of Computer Systems.
Leonard Kleinrock, foundational queueing systems texts.
vLLM and PagedAttention papers and engineering materials for LLM serving behavior.
NVIDIA TensorRT-LLM, Triton Inference Server, and CUDA performance guidance for batching and serving.
Public engineering writing on prefill/decode disaggregation, KV transfer, and long-context inference serving.
Classical work on tail latency, datacenter queueing, and service-time variability in distributed systems.
Dean and Barroso, “The Tail at Scale,” for why small variance matters so much in large interactive systems.
Production guidance from hyperscaler and inference-runtime teams on admission control, retry storms, and latency SLO design.