Batch size is a number. Batch geometry is a distribution.
When a serving system reports "batch size = 32," it communicates one scalar. But the memory system experiences something far richer: 32 requests, each with a sequence length, a generation phase, a KV-cache quantization format, and an attention-head configuration. The combination of these properties — the geometry of the batch — determines almost everything about the batch's actual resource cost.
Consider two concrete batch-32 configurations on an H100 serving a 7B parameter model:
Batch size is a number. Batch geometry is a distribution. The memory system doesn't see the number — it sees the distribution, and the two can disagree by 10×.
The four dimensions of batch geometry
A batch's memory cost is determined by four geometric properties that compose multiplicatively:
| Dimension | What It Controls | Memory Impact |
|---|---|---|
| Length distribution | Sequence lengths across the batch | KV-cache footprint scales linearly with total tokens. A few long sequences dominate the sum. |
| Phase composition | Fraction of requests in prefill vs. decode | Prefill reads the full prompt KV and generates compute-heavy attention matrices. Decode touches all existing KV pages per token. Mixed-phase batches create contention. |
| Quantization heterogeneity | KV-cache format per request | fp16 KV pages are 4× larger than int4. A batch mixing both formats creates fragmentation in the page allocator. |
| Head occupancy | GQA/MQA head sharing across requests | Grouped-query attention means KV pages serve multiple query heads. Head occupancy affects the ratio of KV memory to compute — more sharing = less memory per FLOP. |
Length distribution: the dominant dimension
Length distribution alone accounts for most of the variance in batch memory cost. The total KV-cache footprint of a batch is proportional to the sum of sequence lengths, not the count. For a batch of n requests with lengths L1, …, Ln, the footprint is:
Σ Li × n_layers × n_kv_heads × head_dim × 2 × bytes_per_elementFor a 7B model with 32 layers, 8 KV heads, head_dim=128 at fp16: each token costs ~128 KB of KV memory. A 128K request alone consumes ~16 GB.
This means the variance and skewness of the length distribution matter far more than the mean. A batch with high length variance (a few whales + many short requests) has a dramatically higher memory cost than a batch with the same total token count distributed uniformly. The coefficient of variation of the length distribution is a better predictor of memory pressure than batch size.
Phase composition: compute vs. bandwidth
Prefill and decode have opposite resource profiles. Prefill is compute-bound: it processes the full prompt in one pass, limited by FLOPS. Decode is memory-bound: it generates one token at a time, reading the entire KV cache per step, limited by HBM bandwidth.
When both phases coexist in the same batch — a common occurrence in continuous batching — they compete for different resources simultaneously. The prefill requests saturate the compute units while the decode requests saturate HBM bandwidth. Neither can run at full efficiency because the other is contending for the shared memory controller.
This is the fundamental argument for prefill-decode disaggregation. But even without disaggregation, phase-aware batch formation — grouping prefill requests and decode requests into separate micro-batches — can reduce contention. The key insight is that a batch's phase composition is a geometric property that determines its resource contention profile, not just its throughput.
Quantization heterogeneity: the fragmentation tax
Modern serving systems support multiple KV-cache quantization formats: fp16, fp8, int4, even int2 for aggressive compression. When different requests within the same batch use different formats — because the serving system supports per-request quality tiers, or because long-context requests are quantized more aggressively to fit — the page allocator must manage heterogeneous-sized pages.
A paged attention system with a fixed page size encounters fragmentation when KV data sizes vary per request. An fp16 KV page holds the same number of tokens but consumes 4× the storage of an int4 page. If the allocator uses a single page pool with a fixed page size (sized for the largest format), int4 requests waste 75% of each page. If it uses multiple page pools, cross-pool fragmentation introduces complexity.
Why this matters for memory-system design
If batch geometry determines memory cost, then the memory system should reason about geometry — not just batch count. The practical implications split into three design areas:
Geometry-aware admission control
Admission control should not just check "is there a scheduling slot?" — it should estimate the incoming batch's geometric properties and admit requests that maintain a target geometry. Specifically: capping the length variance within a batch, limiting the number of concurrent prefill requests, and ensuring quantization-format homogeneity when possible.
Geometry-aware eviction policy
When the KV pool is full, the eviction policy should reason about batch geometry. Evicting a page from a long-context request has a different geometric impact than evicting a page from a short request: the long request can absorb the miss (it has thousands of remaining pages) while the short request may lose a significant fraction of its working set. Geometry-aware eviction would weight the eviction cost by the request's length fraction of the batch.
Geometry metrics in telemetry
Serving stacks should export batch geometry as a first-class telemetry signal — not just "batch size" and "total tokens," but length distribution (coefficient of variation, max/min ratio), phase composition (prefill fraction), and quantization mix. These signals predict memory pressure far better than scalar batch size.
Where this leads
Batch geometry is not just an observation — it is a design lever. Once the serving system treats batch geometry as a structured, measurable, and controllable property, several new optimizations become possible: geometry-constrained scheduling that maintains target variance bounds, predictive memory reservation based on estimated geometry, and geometry-aware SLO enforcement that attributes latency to geometric causes rather than scalar overload.
The broader implication is that the transition from "batch size" to "batch geometry" mirrors a deeper shift in inference systems engineering: from treating memory as a passive resource to treating it as an active design surface with structural properties that compound through the system.
The question is not "how big is the batch?" The question is "what shape is the batch?" — because the memory system answers the second question whether you ask it or not.