Inference Optimization: State of the Art and Where It Goes Next

Three years ago, "inference optimization" meant quantizing your model to INT8 and enabling batching. Today the stack has grown into a multi-layer discipline with techniques operating at every level from kernel design to serving topology. This piece maps the current state — what's production-ready, what's still research, and where the next meaningful gains are likely to come from — as of mid-2025.

This is not a comprehensive survey. It's a practitioner's view from watching infrastructure companies in our portfolio and the broader ecosystem build and deploy these techniques. The emphasis is on what works at production scale and under what conditions, not on what produces the best benchmark numbers in controlled experiments.

Layer 1: Model-level optimization (now table stakes)

Post-training quantization to INT8 and FP8 is effectively standard practice for production LLM deployments. The quality tradeoff for typical text generation tasks is minimal (1-3% benchmark degradation) and the memory and compute savings are substantial. The open-source tooling — GPTQ, AWQ, llm.int8() — is mature enough that teams without specialized optimization expertise can apply it reliably.

FP8 deserves specific mention: Hopper-class GPUs (H100) have native FP8 GEMM support, which enables roughly 2x the throughput of FP16 GEMM at the hardware level. Combined with FP8 quantization, teams on H100 infrastructure can achieve significantly higher token generation rates than the same model on A100 without code changes at the application layer. This is a hardware-software co-optimization that wasn't available two years ago and is now mainstream.

INT4 via GPTQ and AWQ remains in the "use with caution" category for most production workloads. The quality degradation is workload-specific and hard to predict without domain-specific evaluation. For non-interactive batch processing where latency requirements are relaxed, INT4 can be the right tradeoff. For interactive applications with quality-sensitive outputs, the risk of tail quality degradation is usually not worth the memory savings over FP8.

Layer 2: Serving infrastructure (the current battleground)

Continuous batching with PagedAttention is now the baseline for any serious LLM serving deployment. The improvement over static batching is unambiguous and well-documented. The competitive differentiation in serving infrastructure has moved up the stack: scheduler design, preemption policies, and tensor parallelism configurations are where meaningful performance differences emerge between well-tuned and poorly-tuned deployments.

Speculative decoding has moved from research paper to production deployment across several teams. The production numbers align reasonably with the research claims for appropriate workloads: 1.5-3x wall-clock latency improvement on latency-sensitive paths for tasks with predictable output structure. The implementation complexity — maintaining draft and target models in memory simultaneously, managing draft acceptance rate monitoring, handling multi-GPU configurations where draft and target models may be sharded differently — is real and non-trivial. Teams that have shipped it in production generally report that the engineering cost was higher than expected but the user experience improvement (for interactive applications where TTFT matters) was worth it.

Tensor parallelism and pipeline parallelism across multiple GPUs is now standard for models that don't fit in a single GPU's HBM. The sweet spot configurations — how many GPUs for a given model size, tensor vs. pipeline vs. expert parallelism — are better understood now than they were 18 months ago, with clearer community knowledge about the trade-offs at different model sizes and request volumes.

Layer 3: KV cache management (the emerging frontier)

As context windows have grown — 128K token contexts are now routine in production, with some deployments targeting 1M+ — KV cache management has become the dominant challenge in serving architecture. The memory footprint of KV caches for long-context requests can easily exceed the model weights themselves, and the scheduling implications are significant.

Prefix caching is the current production-ready technique: when multiple requests share a common prefix (system prompt, document context), the KV cache for that prefix is computed once and reused across requests. The hit rate for prefix caching depends heavily on workload structure. For applications with a fixed system prompt and variable user turns, prefix caching can reduce prefill compute by 40-60% — a substantial operational cost reduction. For workloads with high output diversity and minimal shared context, prefix caching provides minimal benefit.

KV cache offloading — moving inactive KV cache from GPU HBM to CPU DRAM or NVMe storage — extends the effective context capacity but introduces I/O latency that has to be managed carefully. The operational design decisions around offloading (threshold for eviction, prefetch strategy on cache restore, bandwidth allocation between compute and I/O) are not yet well-standardized, and teams building long-context serving are largely doing original engineering rather than applying established patterns.

Layer 4: Kernel and compiler optimization (high ceiling, specialized access)

FlashAttention — and its successors FlashAttention-2 and FlashAttention-3 — have become standard components in the serving stacks of every serious LLM deployment. The memory-efficient attention computation they enable is not optional at production scale for large models. They're open-source, well-maintained, and integrated into the major serving frameworks.

Beyond FlashAttention, custom CUDA kernel development for specific model operations (fused softmax, custom activation functions, quantized GEMM configurations) is where teams with dedicated kernel engineers differentiate. This is genuinely specialized work — good CUDA kernel development requires deep understanding of GPU architecture and memory hierarchy — and the gap between a model served with well-optimized kernels versus one served with stock implementations can be 1.5-2x in throughput for memory-bandwidth-bound operations.

Compiler approaches — using TVM, XLA, or torch.compile to automatically optimize model graphs for specific hardware targets — have matured significantly. The automatic optimizations are meaningful (10-30% improvement in many cases) without requiring kernel expertise. The ceiling is lower than hand-optimized kernels, but the development cost is dramatically lower, making compiler-based optimization the right tradeoff for most teams.

Where the next decade of gains come from

The honest answer is: hardware and systems co-evolution, not any single software technique. GB200 NVLink systems, with their dramatically higher chip-to-chip bandwidth, will unlock serving configurations that aren't viable on current hardware — particularly for large-scale multi-GPU inference where current interconnect bandwidth limits parallelism efficiency.

On the software side, the most interesting frontier is serving architectures designed for compound AI systems — multi-step pipelines with retrieval, tool use, and generation stages that need to be scheduled efficiently as a unit rather than as independent requests. The current generation of serving infrastructure is optimized for single-model request-response workloads. As production AI systems become more complex, the serving layer has to evolve to match that complexity.

We're not saying the single-model serving problem is solved — there are still meaningful efficiency gains available in scheduling, KV cache management, and kernel optimization. But the next major category of infrastructure companies in inference will likely be built around serving complexity rather than serving efficiency for simple request-response workloads. That's where we're focused for the second half of Fund II deployment.