Every inference system eventually forces a concrete choice: do you optimize for the time a single request takes end-to-end, or do you optimize for how many requests per second you can serve? In practice you're always doing both, but they pull in opposite directions, and the team that fails to understand which one matters more for their use case will over-engineer the wrong thing.

This is not a product decision. It's a systems design decision that gets baked into architecture long before you have the load numbers to validate it. I've seen teams at the platform engineering level — including a few we work closely with — make expensive mistakes here because they conflated "low latency" with "fast inference" and "high throughput" with "efficient inference." Those are related but not the same, and the gap between them matters enormously at scale.

Why the trade-off is structural, not tunable

The core tension sits at the batching layer. GPUs are fundamentally throughput-optimized hardware. Their compute efficiency — the ratio of useful FLOPs to total FLOPs capacity — increases with batch size. A single inference request sent to an A100 in isolation uses a small fraction of that card's theoretical throughput. Pack 32 requests into a batch and you start to approach meaningful utilization.

But batching adds latency. You either wait for a batch to fill (static batching, now largely replaced), or you implement continuous batching where new requests join in-flight batches at iteration boundaries. Even with continuous batching, the queueing dynamics mean that high-throughput operation necessarily involves non-zero wait time for individual requests. This is not a bug in the implementation — it's a consequence of how the hardware works.

The inverse is equally true. If you optimize purely for P50 or P99 latency — dispatch requests immediately, minimize queue depth, run small or no batches — you leave utilization on the floor. For a real-time application where someone is waiting for a response, that might be the right call. For an async processing workload where you're synthesizing documents in the background, paying 3x the compute cost per token to shave 200ms off each request is irrational.

The request profile changes everything

The second-order insight that teams miss: the optimal point on the latency-throughput curve is not fixed; it shifts with request characteristics. Short prompts with long outputs (e.g., code generation) have a different profile than long prompts with short outputs (e.g., document classification). Streaming responses — where you're sending tokens to the client as they're generated — have different latency semantics than wait-for-complete responses.

A team running a coding assistant has to care about time-to-first-token (TTFT) because perceived responsiveness starts the moment the cursor blinks. A team running batch document processing at night cares almost exclusively about tokens per second per dollar. These are not "different product requirements" — they require fundamentally different serving configurations on the same underlying hardware.

One of the companies we've spent time with on this problem — in the real-time recommendation space — discovered after months of tuning their serving layer that the latency variance (P99 vs. P50 spread) mattered more for their use case than absolute P50. Their application had a hard SLA: responses needed to arrive within a consistent window. Occasional spikes to 400ms from a 60ms baseline were causing application-layer failures even though median performance looked fine. The fix was not faster hardware or a better model — it was rethinking how they bounded queue depth and shedded load under contention.

Speculative decoding as a partial escape hatch

Speculative decoding — where a smaller draft model generates candidate tokens verified by the larger target model — has emerged as one of the more practical techniques for improving latency without proportionally sacrificing throughput. The intuition: many tokens in a sequence are predictable enough that a cheap draft model gets them right 70-80% of the time. The expensive model validates in parallel rather than generating sequentially, producing a speedup proportional to the acceptance rate.

The catch is that speculative decoding adds implementation complexity and its gains are uneven. For tasks where the output is highly predictable (simple Q&A, template filling), the acceptance rate is high and gains are real — 2-3x wall-clock speedup on latency-sensitive paths. For creative generation or tasks with high output entropy, acceptance rates drop and the overhead of running a draft model continuously can reduce throughput relative to baseline. It's not a universal win; it's a workload-specific optimization that requires profiling to apply correctly.

Where most teams get the measurement wrong

The measurement question is underappreciated. Teams benchmark their inference systems by averaging request latency across a load test, look at mean throughput, and call it done. This hides the dynamics that actually matter in production.

What you need to measure: TTFT separately from inter-token latency (ITL), because they have different causes and different fixes. P99 separately from P50, because the tail matters for user-facing applications. Throughput under sustained load versus burst load, because queue behavior under steady state is different from behavior when load spikes. KV cache hit rates, because cache pressure changes latency characteristics non-linearly as context lengths grow.

We're not saying average latency is a meaningless metric — it's fine as a starting point. But teams that ship to production with only P50 data are going to get surprised by tail behavior at scale.

The hardware-software co-design implication

One thing that's become clear watching inference infrastructure companies operate: the latency-throughput trade-off is increasingly a hardware selection problem as much as a software optimization problem. H100s offer higher memory bandwidth than A100s, which reduces the memory-bound portion of the trade-off for large models. AMD's MI300X has a significantly larger HBM pool, which enables longer context lengths without memory pressure artifacts that drive latency spikes.

But hardware alone doesn't solve the problem — the serving layer has to be tuned to the hardware's characteristics. A continuous batching implementation that worked well on A100 may leave throughput on the floor on H100 if the batch scheduling logic wasn't updated to account for different memory bandwidth profiles. This is why we see infra teams spending significant engineering time on hardware-specific optimizations even after shipping a working system on previous-generation hardware.

The teams that understand this loop — application latency requirement → serving configuration → hardware selection → KV cache management → measurement feedback → back to serving configuration — are the ones that avoid replatforming at every scale milestone. Getting these decisions right early is genuinely difficult, and the cost of getting them wrong compounds as request volume grows.