vLLM Deep Dives
September 4, 2025 Technical Deep Dive ~22 min read

vLLM Internals: Where to Actually Cut Batch Size When the GPU is Melting

Production vLLM deployments running at 90%+ GPU util will eventually hit thermal limits. Counterintuitively, reducing max_num_seqs during a thermal event often does nothing. This post traces exactly where batch size is enforced, why decode batches ignore your knobs, and how to build a thermal-aware scheduler that actually saves your GPUs.

vLLM >= 0.5.4 CUDA 12.1+ Scheduler KV Cache

Target Audience

This post assumes familiarity with vLLM architecture, KV cache semantics, and GPU profiling. We reference commit 8f3a2c1 from vllm-project/vllm. YMMV on forks.

1. Code Walk: vllm/core/scheduler.py schedule() Loop #

The core scheduling decision lives in Scheduler.schedule(). As of v0.5.4, the method signature hasn't changed materially since the 0.4.0 refactor. The loop enforces max_num_seqs at multiple points, but the ordering matters.

Key excerpt from vllm/core/scheduler.py:470:

class Scheduler:

    def schedule(self) -> SchedulerOutputs:
        # 1. Schedule running requests that can make progress
        running_scheduled: List[SequenceGroup] = []
        running_queue = self.running
        budget = self.scheduler_config.max_num_seqs
        
        while running_queue:
            seq_group = running_queue[0]
            num_new_seqs = seq_group.num_seqs()
            if num_new_seqs > budget:
                break
            
            # Check if we can allocate KV blocks for decode
            if not self._can_allocate(seq_group):
                break
                
            budget -= num_new_seqs
            running_queue.popleft()
            running_scheduled.append(seq_group)

        # 2. Schedule waiting requests only if budget remains
        waiting_scheduled: List[SequenceGroup] = []
        while self.waiting and budget > 0:
            seq_group = self.waiting[0]
            num_new_seqs = seq_group.num_seqs()
            if num_new_seqs > budget:
                break
                
            # Prefill: requires full context_length blocks
            if not self._can_allocate(seq_group, is_prefill=True):
                # Trigger preemption if enabled
                if self._can_preempt(seq_group):
                    self._preempt(seq_group)
                    continue
                else:
                    break
                    
            budget -= num_new_seqs
            self.waiting.popleft()
            waiting_scheduled.append(seq_group)
        
        return SchedulerOutputs(
            scheduled_seq_groups=running_scheduled + waiting_scheduled,
            # ...
        )

Critical observation: max_num_seqs becomes budget and is decremented per SequenceGroup, not per token. A SequenceGroup with n=4 for beam search consumes 4 from budget. Once running_queue is exhausted, new prefills only run if budget remains.

The subtle part: self.running is populated every iteration from the previous step's scheduled_seq_groups plus any resumed requests. If you hot-reload scheduler_config.max_num_seqs, it takes effect immediately but only for new scheduling decisions.

2. Why Cutting max_num_seqs Alone Doesn't Help Running Decode Requests #

When operators see GPU temp at 89°C, the naive fix is: POST /v1/admin/config {"max_num_seqs": 8}. This works if your workload is prefill-bound. It does nothing for decode-bound serving because decode is not gated by max_num_seqs once the request is RUNNING.

Three reasons:

  1. No mid-generation eviction: Once a seq is in self.running, schedule() keeps scheduling it unless _can_allocate() returns False. For decode, _can_allocate only checks if 1 new token block is needed. If the block is already mapped, it always returns True.
  2. Budget is advisory, not preemptive: The budget check if num_new_seqs > budget: break only applies to adding new items to running_scheduled. It never kicks existing items out.
  3. KV blocks are sticky: The BlockSpaceManager doesn't reclaim blocks from RUNNING sequences unless swap_out or free is called explicitly. Thermal pressure is not an existing trigger.

Production Gotcha

At 8k context length, 64 decode sequences = 64 * 8192 tokens * 2 * 2 bytes * 80 layers ≈ 168 GB KV cache. If you hit thermal throttling here, reducing max_num_seqs from 64 to 32 changes nothing until requests complete. Your GPU will stay at TjMax.

The only way to shrink the active batch is to move requests from RUNNING -> SWAPPED or RUNNING -> WAITING. vLLM can preempt, but the preemption policy in schedule() only triggers when a prefill can't allocate. We need a new trigger: THERMAL_PRESSURE.

3. KV Cache Block Manager and How to Trigger swap_out() #

The BlockSpaceManagerV1 in vllm/core/block_manager.py:285 owns all GPU/CPU block tables. Swapping is implemented via swap_out(seq_group) which calls CUDA async H2D copies.

# vllm/core/block_manager.py
class BlockSpaceManagerV1:

    def can_swap_out(self, seq_group: SequenceGroup) -> bool:
        num_required_blocks = self._get_num_required_blocks(seq_group)
        num_free_blocks = self.gpu_allocator.get_num_free_blocks()
        num_required_blocks += (num_required_blocks 
                                * self.watermark)  # 0.01 default
        return num_required_blocks <= num_free_blocks

    def swap_out(self, seq_group: SequenceGroup) -> None:
        seqs = seq_group.get_seqs(status=SequenceStatus.RUNNING)
        for seq in seqs:
            blocks = self.block_tables[seq.seq_id]
            # Asynchronously copy blocks GPU -> CPU
            self._swap_blocks(blocks, self.gpu_allocator, self.cpu_allocator)
            
        self._last_access_time_table[seq_group.request_id] = time.time()
        
    def _swap_blocks(self, blocks: List[PhysicalTokenBlock], 
                     src: BaseGPUAllocator, dst: BaseCPUAllocator):
        for block in blocks:
            if block.computed:  # Only swap computed blocks
                dst.allocate_block(block.block_number)
                src.swap_out(block)  # cudaMemcpyAsync under the hood

To forcibly reduce GPU load, we must: 1) Select victim SequenceGroups from self.running, 2) Call block_manager.swap_out(), 3) Move them to self.swapped. This is exactly what _preempt() does, but preempt is only called during prefill starvation.

Minimal thermal policy implementation sketched:

# Extension to Scheduler
def schedule_thermal_aware(self, gpu_temp_c: float) -> SchedulerOutputs:
    target_running = self._calc_thermal_target_batch(gpu_temp_c)
    current_running = len(self.running)
    
    if current_running > target_running:
        # Sort by lowest priority: longest remaining, lowest QPS
        victims = self._select_victims(current_running - target_running)
        for sg in victims:
            self.block_manager.swap_out(sg)
            self.running.remove(sg)
            self.swapped.append(sg)
            self._last_preemption[sg.request_id] = time.time()
            
    return self.schedule()  # Continue normal scheduling

The missing piece is _calc_thermal_target_batch. Empirically, decode power draw is ~linear with batch size at fixed context length. On H100 SXM, we measured ~3.2W per decode token at 4k ctx. So to drop 100W, evict ~31 running sequences.

4. Proposed /v1/admin/batch API Design #

vLLM's OpenAI-compatible server lacks runtime control planes. We propose a namespaced admin API for dataplane operators. This should be behind --enable-admin-api flag and mTLS.

Endpoint: POST /v1/admin/batch

Request Body:

# pydantic model
class BatchControlRequest(BaseModel):
    max_num_seqs: Optional[int] = Field(
        None, description="New max_num_seqs for scheduler_config")
    force_evict: Optional[int] = Field(
        None, ge=0, description="Immediately swap out N RUNNING sequences")
    target_temp_c: Optional[float] = Field(
        None, le=95.0, description="Target GPU temp. vLLM will evict until reached")
    policy: Literal["lru", "largest_kv", "lowest_priority"] = "lru"
    dry_run: bool = False

Example: NVML reports 91°C, we want 80°C ceiling.

curl -X POST https://vllm-0:8000/v1/admin/batch \
  -H "Authorization: Bearer $VLLM_ADMIN_TOKEN" \
  -d '{"target_temp_c": 80.0, "policy": "largest_kv"}'

Response:

class BatchControlResponse(BaseModel):
    previous_running: int
    new_running: int
    evicted_request_ids: List[str]
    estimated_watts_saved: float
    new_max_num_seqs: int

Implementation: The API handler injects a ControlEvent into the AsyncLLMEngine event loop. The scheduler checks for pending events at the top of step(). If target_temp_c is set, the scheduler enters a control loop until nvidia_smi reports temp below target. See reference implementation in thermal-ctrl-harness which uses DCGM + gRPC.

Safety Note

Evicting too aggressively causes swap thrashing. Empirically keep swap bandwidth < 20 GB/s or H2D copies will dominate step time. Always combine with enable_chunked_prefill=True to avoid head-of-line blocking.

5. Request Lifecycle with Thermal Interrupts #

The standard vLLM state machine is WAITING -> RUNNING -> FINISHED. With thermal interrupts we add RUNNING -> SWAPPED -> RUNNING transitions that are not prefill-driven.

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#0ea5e9', 'primaryTextColor': '#fff', 'primaryBorderColor': '#0284c7', 'lineColor': '#52525b', 'secondaryColor': '#27272a', 'tertiaryColor': '#18181b' }}}%% stateDiagram-v2 [*] --> WAITING: New request WAITING --> RUNNING: schedule() + KV allocate RUNNING --> FINISHED: EOS or max_tokens RUNNING --> SWAPPED: Thermal preemption RUNNING --> SWAPPED: KV cache full SWAPPED --> RUNNING: Resume + KV allocate SWAPPED --> WAITING: Cancelled by client FINISHED --> [*] note right of RUNNING Thermal Control Hook POST /v1/admin/batch force_evict=N end note note left of SWAPPED Blocks in CPU RAM swap_in() latency ~50us/block swap_out() latency ~40us/block end note

Key insight: after a thermal interrupt, the swapped request re-enters WAITING priority queue. If you don't also reduce max_num_seqs, it will be immediately rescheduled and you'll oscillate. The /v1/admin/batch call should atomically set both force_evict and a lower max_num_seqs.

6. SchedulerThermalPolicy Plugin Interface RFC #

Hardcoding thermal logic in Scheduler is not tenable. Different orgs use DCGM, nvidia-smi, ROCm-SMI, or custom BMC telemetry. We propose a plugin interface merged into vllm/core/policy.py.

# vllm/core/policy.py - proposed new file
from abc import ABC, abstractmethod
from typing import List, TYPE_CHECKING
if TYPE_CHECKING:
    from vllm.core.scheduler import Scheduler
    from vllm.sequence import SequenceGroup

class SchedulerThermalPolicy(ABC):
    """Decides which sequences to evict under thermal/gpu pressure."""
    
    @abstractmethod
    def get_max_running_seqs(self, scheduler: "Scheduler") -> int:
        """Return dynamic cap for len(self.running). Must be <= max_num_seqs."""
        raise NotImplementedError
    
    @abstractmethod
    def select_victims(self, 
                       running: List["SequenceGroup"], 
                       num_to_evict: int) -> List["SequenceGroup"]:
        """Pick N sequences to swap_out. Called when len(running) > cap."""
        raise NotImplementedError
        
    def on_temp_reading(self, gpu_idx: int, temp_c: float) -> None:
        """Optional: scheduler calls this every step with latest DCGM reading."""
        pass

# Example implementation
class NVIDIADCGMThermalPolicy(SchedulerThermalPolicy):
    def __init__(self, target_temp_c: float = 82.0, k_p: float = 0.5):
        self.target = target_temp_c
        self.k_p = k_p  # Proportional gain
        self._last_temp = 80.0
        
    def on_temp_reading(self, gpu_idx: int, temp_c: float):
        self._last_temp = temp_c
        
    def get_max_running_seqs(self, scheduler: "Scheduler") -> int:
        # Simple P-controller: error = temp - target
        error = self._last_temp - self.target
        if error <= 0:
            return scheduler.scheduler_config.max_num_seqs
        # Drop 1 seq per 2°C over target, tunable via k_p
        reduction = int(error * self.k_p)
        return max(1, scheduler.scheduler_config.max_num_seqs - reduction)
        
    def select_victims(self, running, num_to_evict):
        # Evict longest KV first - frees most watts
        return sorted(running, 
                      key=lambda sg: sg.get_num_kv_blocks(), 
                      reverse=True)[:num_to_evict]

The Scheduler then loads the policy via entrypoints:

# vllm/core/scheduler.py
class Scheduler:
    def __init__(self, scheduler_config, cache_config, ...):
        # ...
        self.thermal_policy: Optional[SchedulerThermalPolicy] = None
        if envs.VLLM_THERMAL_POLICY:
            self.thermal_policy = load_thermal_policy(envs.VLLM_THERMAL_POLICY)
            
    def schedule(self):
        if self.thermal_policy:
            temp = self._read_gpu_temp()  # via NVML or sidecar
            self.thermal_policy.on_temp_reading(0, temp)
            dynamic_cap = self.thermal_policy.get_max_running_seqs(self)
            self._enforce_dynamic_cap(dynamic_cap)
            
        # ... continue normal scheduling

This interface lets cloud providers ship proprietary policies without upstream changes. The thermal-ctrl-harness repo provides a reference DCGMThermalPolicy and sidecar.

7. Observability: Detecting When to Cut Batch Size #

You cannot control what you don't measure. vLLM already exports Prometheus metrics. Combine with DCGM for full picture.

Key metrics:

# vllm metrics
vllm:num_requests_running          # Current batch
vllm:num_requests_swapped          # Evicted due to capacity
vllm:gpu_cache_usage_perc          # 0-1, KV cache util
vllm:time_to_first_token_seconds   # Prefill SLO

# DCGM metrics
DCGM_FI_DEV_GPU_TEMP               # Gauge Celcius
DCGM_FI_DEV_POWER_USAGE            # Watts
DCGM_FI_PROF_PCIE_TX_BYTES         # Detect swap thrashing

Alert rule: Fire when temp > 85C AND batch isn't shrinking.

groups:
- name: vllm_thermal.rules
  rules:
  - alert: VLLMThermalRunaway
    expr: |
      max_over_time(DCGM_FI_DEV_GPU_TEMP{pod=~"vllm-.*"}[2m]) > 85
      and
      deriv(vllm:num_requests_running[2m]) >= 0
    for: 30s
    labels:
      severity: page
    annotations:
      summary: "vLLM GPU {{ $labels.gpu }} at {{ $value }}C but batch not reducing"
      runbook: "curl -X POST .../v1/admin/batch -d '{\"target_temp_c\": 80}'"

Dashboard PromQL: Estimate watts per request to tune eviction.

# Watts per running decode seq
DCGM_FI_DEV_POWER_USAGE 
/ 
vllm:num_requests_running

Upstream RFC: Thermal-Aware Scheduling #

Summary: Add optional SchedulerThermalPolicy plugin and /v1/admin/batch endpoint to allow dynamic batch reduction under thermal/power constraints. This addresses production incidents where vLLM nodes hit TjMax and trigger host-level XID 79 without recourse.

Motivation: Large-context serving on H100/A100 often runs at 700W+. Datacenter cooling failures or co-located noisy neighbors can push GPUs to thermal throttle. Current max_num_seqs is insufficient as it doesn't preempt. We need programmatic control.

Detailed Design:

  1. Add vllm/core/policy.py with ABC shown above.
  2. Add --thermal-policy flag to load via entry_points.
  3. Extend LLMEngine with abort_and_swap(request_ids) API.
  4. Add /v1/admin/batch to OpenAI server behind --enable-admin-api.
  5. DCGM sidecar optional, default policy can use pynvml.

Drawbacks: Increases API surface. Risk of operator shooting themselves in foot by evicting too aggressively. Mitigated by dry_run and rate limits.

Alternatives: Keep it out-of-tree. We believe thermal handling is table-stakes for large providers and should be in-tree but optional.

Validation Checklist #

If you implement this, validate with:

  • Thermal soak test: Pin GPU at 700W with synthetic load. Trigger target_temp_c=80. Verify vllm:num_requests_running drops within 2 steps and temp stabilizes without oscillation.
  • TTFT SLO: Ensure swapped requests don't starve. P99 TTFT should not regress >20% when policy is enabled but idle.
  • PCIE health: Monitor DCGM_FI_PROF_PCIE_TX_BYTES. Swap rate must stay < 25 GB/s on H100. Higher means thrashing.
  • Beam search correctness: Run lm-eval --model vllm --tasks gsm8k --num_fewshot 5 --limit 100 with n=4. Score must match baseline within 0.5% after thermal events.
  • Multi-LoRA safety: Confirm evicted adapters are also swapped. No VRAM leak via orphaned adapter slots.
  • Graceful degradation: With --enable-chunked-prefill, a forced evict mid-prefill must not corrupt KV cache. Test via fault injection.

Comments? Open a thread on the vLLM Discussion Board or ping @runtime-metrics-team. Reference implementation: thermal-ctrl-harness. PRs welcome.

4.1 Production Gotchas #

4.1.1 P-Controller Hysteresis: Prevent Thrashing #

A naive P-controller will thrash at the threshold. If you evict at 82.0°C and resume at 81.9°C, you’ll oscillate and kill p99 worse than doing nothing.

class NVIDIADCGMThermalPolicy:
    def __init__(self, target_c=82, k_p=0.1, hysteresis_c=3.0):
        self.target_c = target_c
        self.hysteresis_c = hysteresis_c # 2-3C buffer required
        self.throttling = False

    def should_throttle(self, temp_c):
        if not self.throttling and temp_c >= self.target_c:
            self.throttling = True
            return True
        elif self.throttling and temp_c < self.target_c - self.hysteresis_c:
            self.throttling = False
            return False
        return self.throttling

Critical

Without hysteresis, you’ll evict/resume every 100ms at 81.9°C/82.0°C. Set hysteresis_c >= 2.0 for production.

4.1.2 Swap Thrashing: The Victim Request TTFT Spike #

swap_out() moves KV cache H2D. While H2D bandwidth is high, latency kills the "victim" request.

Math: PCIe 5.0 x16 = 64 GB/s theoretical, ~40 GB/s sustained. 16GB of KV cache = 400ms just for H2D transfer, before kernel launch overhead.

Rule: Only swap_out() if estimated_queue_time_for_new_request > estimated_swap_in_latency_for_victim. Otherwise you’re making p99 worse to save p50.

Monitor with:

histogram_quantile(0.99,
  rate(vllm_time_to_first_token_seconds_bucket{request_status="swapped"}[1m])
)

Warning

Keep D2H swap bandwidth below 20 GB/s cluster-wide or you’ll saturate PCIe and cause TTFT spikes across all requests.

4.1.3 Multi-GPU / Tensor Parallel: One Hot GPU Poisons the Pipeline #

In TP=8 deployments, thermal throttling often hits one GPU first. But every NCCL all-reduce in every transformer layer blocks on the slowest GPU.

Result: GPU3 at 85°C makes GPUs 0,1,2,4,5,6,7 idle. The whole pipeline throttles.

Solution: Thermal policy must be global across the TP group:

import torch.distributed as dist

local_temp = get_hbm_temp()
global_temp = torch.tensor(local_temp).cuda()
dist.all_reduce(global_temp, op=dist.ReduceOp.MAX) # worst GPU wins

if global_temp.item() >= target_c:
    # All ranks reduce batch together
    throttle_batch_globally()

DCGM tip: Use dcgmi group -c <tp_group_ids> to monitor all TP GPUs atomically. If any GPU in the group is hot, treat all as hot.