vLLM Internals: Where to Actually Cut Batch Size When the GPU is Melting
Production vLLM deployments running at 90%+ GPU util will eventually hit thermal limits. Counterintuitively, reducing max_num_seqs during a thermal event often does nothing. This post traces exactly where batch size is enforced, why decode batches ignore your knobs, and how to build a thermal-aware scheduler that actually saves your GPUs.
Target Audience
This post assumes familiarity with vLLM architecture, KV cache semantics, and GPU profiling. We reference commit 8f3a2c1 from vllm-project/vllm. YMMV on forks.
1. Code Walk: vllm/core/scheduler.py schedule() Loop #
The core scheduling decision lives in Scheduler.schedule(). As of v0.5.4, the method signature hasn't changed materially since the 0.4.0 refactor. The loop enforces max_num_seqs at multiple points, but the ordering matters.
Key excerpt from vllm/core/scheduler.py:470:
class Scheduler:
def schedule(self) -> SchedulerOutputs:
# 1. Schedule running requests that can make progress
running_scheduled: List[SequenceGroup] = []
running_queue = self.running
budget = self.scheduler_config.max_num_seqs
while running_queue:
seq_group = running_queue[0]
num_new_seqs = seq_group.num_seqs()
if num_new_seqs > budget:
break
# Check if we can allocate KV blocks for decode
if not self._can_allocate(seq_group):
break
budget -= num_new_seqs
running_queue.popleft()
running_scheduled.append(seq_group)
# 2. Schedule waiting requests only if budget remains
waiting_scheduled: List[SequenceGroup] = []
while self.waiting and budget > 0:
seq_group = self.waiting[0]
num_new_seqs = seq_group.num_seqs()
if num_new_seqs > budget:
break
# Prefill: requires full context_length blocks
if not self._can_allocate(seq_group, is_prefill=True):
# Trigger preemption if enabled
if self._can_preempt(seq_group):
self._preempt(seq_group)
continue
else:
break
budget -= num_new_seqs
self.waiting.popleft()
waiting_scheduled.append(seq_group)
return SchedulerOutputs(
scheduled_seq_groups=running_scheduled + waiting_scheduled,
# ...
)
Critical observation: max_num_seqs becomes budget and is decremented per SequenceGroup, not per token. A SequenceGroup with n=4 for beam search consumes 4 from budget. Once running_queue is exhausted, new prefills only run if budget remains.
The subtle part: self.running is populated every iteration from the previous step's scheduled_seq_groups plus any resumed requests. If you hot-reload scheduler_config.max_num_seqs, it takes effect immediately but only for new scheduling decisions.
2. Why Cutting max_num_seqs Alone Doesn't Help Running Decode Requests #
When operators see GPU temp at 89°C, the naive fix is: POST /v1/admin/config {"max_num_seqs": 8}. This works if your workload is prefill-bound. It does nothing for decode-bound serving because decode is not gated by max_num_seqs once the request is RUNNING.
Three reasons:
- No mid-generation eviction: Once a seq is in
self.running,schedule()keeps scheduling it unless_can_allocate()returns False. For decode,_can_allocateonly checks if 1 new token block is needed. If the block is already mapped, it always returns True. - Budget is advisory, not preemptive: The budget check
if num_new_seqs > budget: breakonly applies to adding new items torunning_scheduled. It never kicks existing items out. - KV blocks are sticky: The
BlockSpaceManagerdoesn't reclaim blocks fromRUNNINGsequences unlessswap_outorfreeis called explicitly. Thermal pressure is not an existing trigger.
Production Gotcha
At 8k context length, 64 decode sequences = 64 * 8192 tokens * 2 * 2 bytes * 80 layers ≈ 168 GB KV cache. If you hit thermal throttling here, reducing max_num_seqs from 64 to 32 changes nothing until requests complete. Your GPU will stay at TjMax.
The only way to shrink the active batch is to move requests from RUNNING -> SWAPPED or RUNNING -> WAITING. vLLM can preempt, but the preemption policy in schedule() only triggers when a prefill can't allocate. We need a new trigger: THERMAL_PRESSURE.
3. KV Cache Block Manager and How to Trigger swap_out() #
The BlockSpaceManagerV1 in vllm/core/block_manager.py:285 owns all GPU/CPU block tables. Swapping is implemented via swap_out(seq_group) which calls CUDA async H2D copies.
# vllm/core/block_manager.py
class BlockSpaceManagerV1:
def can_swap_out(self, seq_group: SequenceGroup) -> bool:
num_required_blocks = self._get_num_required_blocks(seq_group)
num_free_blocks = self.gpu_allocator.get_num_free_blocks()
num_required_blocks += (num_required_blocks
* self.watermark) # 0.01 default
return num_required_blocks <= num_free_blocks
def swap_out(self, seq_group: SequenceGroup) -> None:
seqs = seq_group.get_seqs(status=SequenceStatus.RUNNING)
for seq in seqs:
blocks = self.block_tables[seq.seq_id]
# Asynchronously copy blocks GPU -> CPU
self._swap_blocks(blocks, self.gpu_allocator, self.cpu_allocator)
self._last_access_time_table[seq_group.request_id] = time.time()
def _swap_blocks(self, blocks: List[PhysicalTokenBlock],
src: BaseGPUAllocator, dst: BaseCPUAllocator):
for block in blocks:
if block.computed: # Only swap computed blocks
dst.allocate_block(block.block_number)
src.swap_out(block) # cudaMemcpyAsync under the hood
To forcibly reduce GPU load, we must: 1) Select victim SequenceGroups from self.running, 2) Call block_manager.swap_out(), 3) Move them to self.swapped. This is exactly what _preempt() does, but preempt is only called during prefill starvation.
Minimal thermal policy implementation sketched:
# Extension to Scheduler
def schedule_thermal_aware(self, gpu_temp_c: float) -> SchedulerOutputs:
target_running = self._calc_thermal_target_batch(gpu_temp_c)
current_running = len(self.running)
if current_running > target_running:
# Sort by lowest priority: longest remaining, lowest QPS
victims = self._select_victims(current_running - target_running)
for sg in victims:
self.block_manager.swap_out(sg)
self.running.remove(sg)
self.swapped.append(sg)
self._last_preemption[sg.request_id] = time.time()
return self.schedule() # Continue normal scheduling
The missing piece is _calc_thermal_target_batch. Empirically, decode power draw is ~linear with batch size at fixed context length. On H100 SXM, we measured ~3.2W per decode token at 4k ctx. So to drop 100W, evict ~31 running sequences.
4. Proposed /v1/admin/batch API Design #
vLLM's OpenAI-compatible server lacks runtime control planes. We propose a namespaced admin API for dataplane operators. This should be behind --enable-admin-api flag and mTLS.
Endpoint: POST /v1/admin/batch
Request Body:
# pydantic model
class BatchControlRequest(BaseModel):
max_num_seqs: Optional[int] = Field(
None, description="New max_num_seqs for scheduler_config")
force_evict: Optional[int] = Field(
None, ge=0, description="Immediately swap out N RUNNING sequences")
target_temp_c: Optional[float] = Field(
None, le=95.0, description="Target GPU temp. vLLM will evict until reached")
policy: Literal["lru", "largest_kv", "lowest_priority"] = "lru"
dry_run: bool = False
Example: NVML reports 91°C, we want 80°C ceiling.
curl -X POST https://vllm-0:8000/v1/admin/batch \
-H "Authorization: Bearer $VLLM_ADMIN_TOKEN" \
-d '{"target_temp_c": 80.0, "policy": "largest_kv"}'
Response:
class BatchControlResponse(BaseModel):
previous_running: int
new_running: int
evicted_request_ids: List[str]
estimated_watts_saved: float
new_max_num_seqs: int
Implementation: The API handler injects a ControlEvent into the AsyncLLMEngine event loop. The scheduler checks for pending events at the top of step(). If target_temp_c is set, the scheduler enters a control loop until nvidia_smi reports temp below target. See reference implementation in thermal-ctrl-harness which uses DCGM + gRPC.
Safety Note
Evicting too aggressively causes swap thrashing. Empirically keep swap bandwidth < 20 GB/s or H2D copies will dominate step time. Always combine with enable_chunked_prefill=True to avoid head-of-line blocking.
5. Request Lifecycle with Thermal Interrupts #
The standard vLLM state machine is WAITING -> RUNNING -> FINISHED. With thermal interrupts we add RUNNING -> SWAPPED -> RUNNING transitions that are not prefill-driven.
Key insight: after a thermal interrupt, the swapped request re-enters WAITING priority queue. If you don't also reduce max_num_seqs, it will be immediately rescheduled and you'll oscillate. The /v1/admin/batch call should atomically set both force_evict and a lower max_num_seqs.
6. SchedulerThermalPolicy Plugin Interface RFC #
Hardcoding thermal logic in Scheduler is not tenable. Different orgs use DCGM, nvidia-smi, ROCm-SMI, or custom BMC telemetry. We propose a plugin interface merged into vllm/core/policy.py.
# vllm/core/policy.py - proposed new file
from abc import ABC, abstractmethod
from typing import List, TYPE_CHECKING
if TYPE_CHECKING:
from vllm.core.scheduler import Scheduler
from vllm.sequence import SequenceGroup
class SchedulerThermalPolicy(ABC):
"""Decides which sequences to evict under thermal/gpu pressure."""
@abstractmethod
def get_max_running_seqs(self, scheduler: "Scheduler") -> int:
"""Return dynamic cap for len(self.running). Must be <= max_num_seqs."""
raise NotImplementedError
@abstractmethod
def select_victims(self,
running: List["SequenceGroup"],
num_to_evict: int) -> List["SequenceGroup"]:
"""Pick N sequences to swap_out. Called when len(running) > cap."""
raise NotImplementedError
def on_temp_reading(self, gpu_idx: int, temp_c: float) -> None:
"""Optional: scheduler calls this every step with latest DCGM reading."""
pass
# Example implementation
class NVIDIADCGMThermalPolicy(SchedulerThermalPolicy):
def __init__(self, target_temp_c: float = 82.0, k_p: float = 0.5):
self.target = target_temp_c
self.k_p = k_p # Proportional gain
self._last_temp = 80.0
def on_temp_reading(self, gpu_idx: int, temp_c: float):
self._last_temp = temp_c
def get_max_running_seqs(self, scheduler: "Scheduler") -> int:
# Simple P-controller: error = temp - target
error = self._last_temp - self.target
if error <= 0:
return scheduler.scheduler_config.max_num_seqs
# Drop 1 seq per 2°C over target, tunable via k_p
reduction = int(error * self.k_p)
return max(1, scheduler.scheduler_config.max_num_seqs - reduction)
def select_victims(self, running, num_to_evict):
# Evict longest KV first - frees most watts
return sorted(running,
key=lambda sg: sg.get_num_kv_blocks(),
reverse=True)[:num_to_evict]
The Scheduler then loads the policy via entrypoints:
# vllm/core/scheduler.py
class Scheduler:
def __init__(self, scheduler_config, cache_config, ...):
# ...
self.thermal_policy: Optional[SchedulerThermalPolicy] = None
if envs.VLLM_THERMAL_POLICY:
self.thermal_policy = load_thermal_policy(envs.VLLM_THERMAL_POLICY)
def schedule(self):
if self.thermal_policy:
temp = self._read_gpu_temp() # via NVML or sidecar
self.thermal_policy.on_temp_reading(0, temp)
dynamic_cap = self.thermal_policy.get_max_running_seqs(self)
self._enforce_dynamic_cap(dynamic_cap)
# ... continue normal scheduling
This interface lets cloud providers ship proprietary policies without upstream changes. The thermal-ctrl-harness repo provides a reference DCGMThermalPolicy and sidecar.
7. Observability: Detecting When to Cut Batch Size #
You cannot control what you don't measure. vLLM already exports Prometheus metrics. Combine with DCGM for full picture.
Key metrics:
# vllm metrics
vllm:num_requests_running # Current batch
vllm:num_requests_swapped # Evicted due to capacity
vllm:gpu_cache_usage_perc # 0-1, KV cache util
vllm:time_to_first_token_seconds # Prefill SLO
# DCGM metrics
DCGM_FI_DEV_GPU_TEMP # Gauge Celcius
DCGM_FI_DEV_POWER_USAGE # Watts
DCGM_FI_PROF_PCIE_TX_BYTES # Detect swap thrashing
Alert rule: Fire when temp > 85C AND batch isn't shrinking.
groups:
- name: vllm_thermal.rules
rules:
- alert: VLLMThermalRunaway
expr: |
max_over_time(DCGM_FI_DEV_GPU_TEMP{pod=~"vllm-.*"}[2m]) > 85
and
deriv(vllm:num_requests_running[2m]) >= 0
for: 30s
labels:
severity: page
annotations:
summary: "vLLM GPU {{ $labels.gpu }} at {{ $value }}C but batch not reducing"
runbook: "curl -X POST .../v1/admin/batch -d '{\"target_temp_c\": 80}'"
Dashboard PromQL: Estimate watts per request to tune eviction.
# Watts per running decode seq
DCGM_FI_DEV_POWER_USAGE
/
vllm:num_requests_running
Upstream RFC: Thermal-Aware Scheduling #
Summary: Add optional SchedulerThermalPolicy plugin and /v1/admin/batch endpoint to allow dynamic batch reduction under thermal/power constraints. This addresses production incidents where vLLM nodes hit TjMax and trigger host-level XID 79 without recourse.
Motivation: Large-context serving on H100/A100 often runs at 700W+. Datacenter cooling failures or co-located noisy neighbors can push GPUs to thermal throttle. Current max_num_seqs is insufficient as it doesn't preempt. We need programmatic control.
Detailed Design:
- Add
vllm/core/policy.pywith ABC shown above. - Add
--thermal-policyflag to load viaentry_points. - Extend
LLMEnginewithabort_and_swap(request_ids)API. - Add
/v1/admin/batchto OpenAI server behind--enable-admin-api. - DCGM sidecar optional, default policy can use
pynvml.
Drawbacks: Increases API surface. Risk of operator shooting themselves in foot by evicting too aggressively. Mitigated by dry_run and rate limits.
Alternatives: Keep it out-of-tree. We believe thermal handling is table-stakes for large providers and should be in-tree but optional.
Validation Checklist #
If you implement this, validate with:
-
Thermal soak test: Pin GPU at 700W with synthetic load. Trigger
target_temp_c=80. Verifyvllm:num_requests_runningdrops within 2 steps and temp stabilizes without oscillation. - TTFT SLO: Ensure swapped requests don't starve. P99 TTFT should not regress >20% when policy is enabled but idle.
-
PCIE health: Monitor
DCGM_FI_PROF_PCIE_TX_BYTES. Swap rate must stay < 25 GB/s on H100. Higher means thrashing. -
Beam search correctness: Run
lm-eval --model vllm --tasks gsm8k --num_fewshot 5 --limit 100withn=4. Score must match baseline within 0.5% after thermal events. - Multi-LoRA safety: Confirm evicted adapters are also swapped. No VRAM leak via orphaned adapter slots.
-
Graceful degradation: With
--enable-chunked-prefill, a forced evict mid-prefill must not corrupt KV cache. Test via fault injection.
Comments? Open a thread on the vLLM Discussion Board or ping @runtime-metrics-team. Reference implementation: thermal-ctrl-harness. PRs welcome.
4.1 Production Gotchas #
4.1.1 P-Controller Hysteresis: Prevent Thrashing #
A naive P-controller will thrash at the threshold. If you evict at 82.0°C and resume at 81.9°C, you’ll oscillate and kill p99 worse than doing nothing.
class NVIDIADCGMThermalPolicy:
def __init__(self, target_c=82, k_p=0.1, hysteresis_c=3.0):
self.target_c = target_c
self.hysteresis_c = hysteresis_c # 2-3C buffer required
self.throttling = False
def should_throttle(self, temp_c):
if not self.throttling and temp_c >= self.target_c:
self.throttling = True
return True
elif self.throttling and temp_c < self.target_c - self.hysteresis_c:
self.throttling = False
return False
return self.throttling
Critical
Without hysteresis, you’ll evict/resume every 100ms at 81.9°C/82.0°C. Set hysteresis_c >= 2.0 for production.
4.1.2 Swap Thrashing: The Victim Request TTFT Spike #
swap_out() moves KV cache H2D. While H2D bandwidth is high, latency kills the "victim" request.
Math: PCIe 5.0 x16 = 64 GB/s theoretical, ~40 GB/s sustained. 16GB of KV cache = 400ms just for H2D transfer, before kernel launch overhead.
Rule: Only swap_out() if estimated_queue_time_for_new_request > estimated_swap_in_latency_for_victim. Otherwise you’re making p99 worse to save p50.
Monitor with:
histogram_quantile(0.99,
rate(vllm_time_to_first_token_seconds_bucket{request_status="swapped"}[1m])
)
Warning
Keep D2H swap bandwidth below 20 GB/s cluster-wide or you’ll saturate PCIe and cause TTFT spikes across all requests.
4.1.3 Multi-GPU / Tensor Parallel: One Hot GPU Poisons the Pipeline #
In TP=8 deployments, thermal throttling often hits one GPU first. But every NCCL all-reduce in every transformer layer blocks on the slowest GPU.
Result: GPU3 at 85°C makes GPUs 0,1,2,4,5,6,7 idle. The whole pipeline throttles.
Solution: Thermal policy must be global across the TP group:
import torch.distributed as dist
local_temp = get_hbm_temp()
global_temp = torch.tensor(local_temp).cuda()
dist.all_reduce(global_temp, op=dist.ReduceOp.MAX) # worst GPU wins
if global_temp.item() >= target_c:
# All ranks reduce batch together
throttle_batch_globally()
DCGM tip: Use dcgmi group -c <tp_group_ids> to monitor all TP GPUs atomically. If any GPU in the group is hot, treat all as hot.