Overlap Scheduler #

To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating responses, scheduling the next batch) with GPU computation.

How It Works#

At stepn, the system launches GPU computation for stepn+1 without waiting for CPU tasks (e.g., stop criteria checks) from stepn to complete. This allows:

CPU work (stepn) and GPU computation (stepn+1) to run concurrently.
Better GPU occupancy by reducing idle time.

This concurrent execution pipeline is illustrated in thePyExecutor’s logic:

# Schedule and launch GPU work for the current step (n)scheduled_batch,_,_=self._schedule()batch_outputs=self._forward_step(scheduled_batch,previous_tensors_device)sample_state=self._sample_async(scheduled_batch,batch_outputs)# While the GPU is busy, process the CPU-bound results from the previous step (n-1)ifself.previous_batchisnotNone:self._process_previous_batch()

Tradeoff#

The optimization introduces one extra decoding step but significantly improves throughput.

Usage#

Enabled by default. To disable, setdisable_overlap_scheduler=True in the configuration.

References#

NanoFlow: Towards Optimal Large Language Model Serving Throughput
https://lmsys.org/blog/2024-12-04-sglang-v0-4/#zero-overhead-batch-scheduler

On this page

Movatterモバイル変換

Overlap Scheduler#

How It Works#

Tradeoff#

Usage#

References#

Overlap Scheduler #