Inference Batch Geometry: Why the Shape of a Batch Determines Its Memory Cost More Than Its Size

Abstract

The MLSys community treats batch size as a scalar — "batch 32" or "batch 64." But for the memory system, a batch is a structured object with geometric properties: length distribution, phase composition (prefill vs. decode), quantization heterogeneity, and attention-head occupancy. Two batches of identical size can differ by 10× in KV-cache footprint, HBM bandwidth demand, and eviction pressure. This essay decomposes batch geometry into its memory-system implications and argues that geometry-aware scheduling — not just size-aware scheduling — is the next leverage point for inference efficiency.

Batch size is a number. Batch geometry is a distribution.

When a serving system reports "batch size = 32," it communicates one scalar. But the memory system experiences something far richer: 32 requests, each with a sequence length, a generation phase, a KV-cache quantization format, and an attention-head configuration. The combination of these properties — the geometry of the batch — determines almost everything about the batch's actual resource cost.

Consider two concrete batch-32 configurations on an H100 serving a 7B parameter model:

Figure 1. Two "batch 32" configurations with the same batch count but radically different memory profiles. Batch B uses 9.6× more KV memory and 3.2× more HBM bandwidth than Batch A. The number 32 tells you almost nothing about the actual resource cost.

Batch size is a number. Batch geometry is a distribution. The memory system doesn't see the number — it sees the distribution, and the two can disagree by 10×.

The four dimensions of batch geometry

A batch's memory cost is determined by four geometric properties that compose multiplicatively:

Dimension	What It Controls	Memory Impact
Length distribution	Sequence lengths across the batch	KV-cache footprint scales linearly with total tokens. A few long sequences dominate the sum.
Phase composition	Fraction of requests in prefill vs. decode	Prefill reads the full prompt KV and generates compute-heavy attention matrices. Decode touches all existing KV pages per token. Mixed-phase batches create contention.
Quantization heterogeneity	KV-cache format per request	fp16 KV pages are 4× larger than int4. A batch mixing both formats creates fragmentation in the page allocator.
Head occupancy	GQA/MQA head sharing across requests	Grouped-query attention means KV pages serve multiple query heads. Head occupancy affects the ratio of KV memory to compute — more sharing = less memory per FLOP.

Length distribution: the dominant dimension

Length distribution alone accounts for most of the variance in batch memory cost. The total KV-cache footprint of a batch is proportional to the sum of sequence lengths, not the count. For a batch of n requests with lengths L₁, …, L_n, the footprint is:

KV footprint ∝ Σ L_i × n_layers × n_kv_heads × head_dim × 2 × bytes_per_element

For a 7B model with 32 layers, 8 KV heads, head_dim=128 at fp16: each token costs ~128 KB of KV memory. A 128K request alone consumes ~16 GB.

This means the variance and skewness of the length distribution matter far more than the mean. A batch with high length variance (a few whales + many short requests) has a dramatically higher memory cost than a batch with the same total token count distributed uniformly. The coefficient of variation of the length distribution is a better predictor of memory pressure than batch size.

Phase composition: compute vs. bandwidth

Prefill and decode have opposite resource profiles. Prefill is compute-bound: it processes the full prompt in one pass, limited by FLOPS. Decode is memory-bound: it generates one token at a time, reading the entire KV cache per step, limited by HBM bandwidth.

When both phases coexist in the same batch — a common occurrence in continuous batching — they compete for different resources simultaneously. The prefill requests saturate the compute units while the decode requests saturate HBM bandwidth. Neither can run at full efficiency because the other is contending for the shared memory controller.

This is the fundamental argument for prefill-decode disaggregation. But even without disaggregation, phase-aware batch formation — grouping prefill requests and decode requests into separate micro-batches — can reduce contention. The key insight is that a batch's phase composition is a geometric property that determines its resource contention profile, not just its throughput.

Quantization heterogeneity: the fragmentation tax

Modern serving systems support multiple KV-cache quantization formats: fp16, fp8, int4, even int2 for aggressive compression. When different requests within the same batch use different formats — because the serving system supports per-request quality tiers, or because long-context requests are quantized more aggressively to fit — the page allocator must manage heterogeneous-sized pages.

A paged attention system with a fixed page size encounters fragmentation when KV data sizes vary per request. An fp16 KV page holds the same number of tokens but consumes 4× the storage of an int4 page. If the allocator uses a single page pool with a fixed page size (sized for the largest format), int4 requests waste 75% of each page. If it uses multiple page pools, cross-pool fragmentation introduces complexity.

Why this matters for memory-system design

If batch geometry determines memory cost, then the memory system should reason about geometry — not just batch count. The practical implications split into three design areas:

Geometry-aware admission control

Admission control should not just check "is there a scheduling slot?" — it should estimate the incoming batch's geometric properties and admit requests that maintain a target geometry. Specifically: capping the length variance within a batch, limiting the number of concurrent prefill requests, and ensuring quantization-format homogeneity when possible.

Geometry-aware eviction policy

When the KV pool is full, the eviction policy should reason about batch geometry. Evicting a page from a long-context request has a different geometric impact than evicting a page from a short request: the long request can absorb the miss (it has thousands of remaining pages) while the short request may lose a significant fraction of its working set. Geometry-aware eviction would weight the eviction cost by the request's length fraction of the batch.

Geometry metrics in telemetry

Serving stacks should export batch geometry as a first-class telemetry signal — not just "batch size" and "total tokens," but length distribution (coefficient of variation, max/min ratio), phase composition (prefill fraction), and quantization mix. These signals predict memory pressure far better than scalar batch size.

The scalar myth: Batch size is the most widely reported metric in inference benchmarks. It is also one of the least informative for memory-system reasoning. A benchmark that reports "batch 32" without specifying the length distribution, phase composition, and quantization format is reporting less than 10% of the information needed to predict memory behavior.

Where this leads

Batch geometry is not just an observation — it is a design lever. Once the serving system treats batch geometry as a structured, measurable, and controllable property, several new optimizations become possible: geometry-constrained scheduling that maintains target variance bounds, predictive memory reservation based on estimated geometry, and geometry-aware SLO enforcement that attributes latency to geometric causes rather than scalar overload.

The broader implication is that the transition from "batch size" to "batch geometry" mirrors a deeper shift in inference systems engineering: from treating memory as a passive resource to treating it as an active design surface with structural properties that compound through the system.

The question is not "how big is the batch?" The question is "what shape is the batch?" — because the memory system answers the second question whether you ask it or not.

inference-batch-geometry-memory-cost.html · April 2026 · ← All writings