Continuous Batching and the Path to Real Throughput Gains

The Orca paper from 2022 introduced continuous batching — also called iteration-level scheduling — and documented the theoretical gains: up to 36x throughput improvement over static batching for LLM serving workloads. Those numbers are real in the specific experimental conditions described. They are not what you will see by enabling continuous batching in a production system without understanding the conditions under which the gains hold.

This piece is about closing the gap between the theoretical case and the operational reality. It's a gap that teams building on top of vLLM, TGI, or any continuous batching-enabled serving framework should understand before they anchor on throughput projections that won't survive contact with a real request distribution.

Why static batching is as bad as it is

The baseline problem: in static batching, a batch of requests is dispatched to the GPU simultaneously. The batch completes when all requests in it have generated their final token. A request that generates 20 tokens is done in a fraction of the time of a request that generates 200 tokens — but both hold their KV cache memory until the longest request in the batch completes. The GPU is running the forward pass for completed requests that are just returning padding tokens while the long-running requests finish.

This is waste at two levels: memory waste (KV cache for completed sequences occupying GPU memory that can't be used for new requests) and compute waste (kernel calls for sequences that have nothing left to compute). At realistic request length distributions — with variance between short and long outputs — the waste compounds. The theoretical maximum throughput of the hardware is dramatically underutilized.

What continuous batching fixes — and what it doesn't

Continuous batching solves the padding waste problem by allowing new requests to join the active batch at iteration (forward pass) boundaries. When a request finishes generating its output, its memory slot is immediately available and a queued request can be promoted into the active batch at the next iteration. The GPU never runs forward passes on completed sequences; utilization on the decode compute path is high continuously.

What continuous batching doesn't automatically solve: memory pressure from the prefill phase. When a new request joins the batch, it has to run prefill — processing the full prompt to generate the initial KV cache. Prefill is compute-intensive and memory-hungry. A long-prompt request preempting can cause memory pressure on a scheduler that's running a large active batch of decode-phase requests. This is the most common cause of the throughput gap between benchmark conditions and production conditions.

The Orca experiments were run on workloads with relatively controlled request distributions. Production workloads are mixed: some short prompts with long outputs, some long prompts with short outputs, some with very long prompts (RAG retrieval context, code context windows). The scheduler has to handle prefill-decode interleaving across this distribution without causing memory pressure that forces preemption of in-progress decode sequences.

KV cache as the central scheduling resource

The key insight for understanding continuous batching at production depth: KV cache memory is the primary scheduling resource, more than GPU compute. The KV cache for an active request grows with each generated token. For a 70B parameter model with 80 layers of attention, a request with a 2,048-token context consumes a substantial fraction of available HBM for that KV state alone.

PagedAttention — the memory management innovation introduced by vLLM — addresses KV cache fragmentation by treating KV cache memory like virtual memory pages. Instead of pre-allocating a contiguous block of memory per request (which leads to fragmentation and wasted capacity when requests have variable output lengths), PagedAttention allocates fixed-size KV cache blocks and maps them dynamically. This allows higher memory utilization and enables copy-on-write for beam search and parallel sampling scenarios.

In practice, PagedAttention gets teams closer to theoretical memory utilization but doesn't eliminate the scheduling complexity. The scheduler still has to manage eviction when memory pressure requires preempting active requests — swapping their KV cache to CPU memory or disk, continuing with higher-priority requests, and restoring preempted requests when memory is available. The quality of preemption policy has a direct effect on the P99 latency distribution, which is frequently the metric that matters most for production SLAs.

Prefill-decode disaggregation: the next layer

The research direction that has gained significant traction in the past year: disaggregating prefill and decode into separate worker pools. The insight is that prefill (compute-bound, run once per request) and decode (memory-bandwidth-bound, run once per generated token) have different hardware utilization profiles. Running them on the same GPU — the standard configuration — means the GPU is suboptimally utilized for both phases simultaneously.

Prefill-decode disaggregation sends long-prompt requests to a prefill worker pool sized and configured for compute-intensive batch prefill, generates the initial KV cache, and transfers it to a decode worker pool for token generation. The transfer adds latency, but the hardware utilization improvement can more than compensate in throughput-focused workloads.

The practical constraint: KV cache transfer between workers adds network overhead that can dominate for moderate context lengths. The disaggregation wins are clearest for very long contexts (8K+ tokens) where the prefill compute dominates and the one-time transfer cost is small relative to the execution savings. For typical short-to-medium context workloads, disaggregation adds complexity without proportional gain.

The measurement infrastructure you need to actually tune this

Running a continuous batching serving system without good telemetry is operating blind. The metrics that matter beyond standard latency and error rate: batch size distribution over time (are you frequently running single-request batches due to low load, indicating under-utilization?), KV cache utilization percentage (are you near capacity, indicating memory pressure?), preemption rate (how often is the scheduler evicting active sequences?), prefill queue depth (are long-prompt requests starving decode throughput?), and token generation rate per GPU (normalized throughput that lets you compare configurations).

Teams that tune these systems well have instrumented serving at the batch and request level, not just the aggregate level. The aggregate metrics — P50 latency, requests per second — don't tell you whether you're leaving throughput on the floor due to scheduler misconfiguration or whether your P99 spikes are caused by occasional large prefill requests starving the decode path. Those diagnoses require request-level telemetry, which is not always on by default in serving frameworks.

We're not saying continuous batching is complicated to get running — it isn't, and the defaults in mature serving frameworks like vLLM are reasonable starting points. The point is that matching the throughput projections from benchmarks to actual production workloads requires understanding the scheduler mechanics well enough to tune them for your request distribution. That's an engineering investment, not a configuration change, and teams that make it have a durable cost advantage over those that don't.