The Inference Cost Inflection Point

The cost to run one million tokens of LLM inference has fallen by roughly two orders of magnitude since GPT-3's release in 2020. That curve is not slowing down — it is accelerating. Understanding the drivers of this deflation, and which ones have diminishing returns versus which ones are still early, matters enormously for how you think about the next generation of AI applications and the infrastructure they will require.

We have been tracking this closely because it changes the investment landscape in inference infrastructure in non-obvious ways. Falling inference cost is not a headwind for inference infrastructure companies — it is a tailwind. Here is why, and here is where we think the curve goes next.

The three drivers of inference cost reduction

Inference cost has fallen through three largely independent mechanisms:

Hardware improvements. The A100 to H100 transition delivered roughly 3x performance improvement on transformer workloads. The H100's HBM3 memory bandwidth and NVLink interconnects addressed the memory bandwidth bottleneck that dominates large model inference. AMD's MI300X provides competitive performance at lower acquisition cost for certain workloads. The hardware curve will continue: the Blackwell generation promises another 2–4x improvement in inference throughput for FP8 precision. These gains compound with software optimizations rather than replacing them.

Software optimization: batching and scheduling. A naive LLM serving implementation serves one request at a time, leaving most of the GPU idle during memory reads and generation steps. Continuous batching — the technique of dynamically adding new requests to in-flight batches as previous requests complete, rather than waiting for the entire batch to finish — can improve throughput by 4x to 8x without any hardware change. PagedAttention, introduced with the vLLM system, solved the memory management problem that made continuous batching impractical at scale by treating the KV cache as paged virtual memory rather than pre-allocated contiguous buffers.

Model compression: quantization and distillation. Running models in 8-bit or 4-bit integer precision instead of 16-bit float reduces memory footprint by 2x to 4x, allowing more model parameters to fit in GPU memory and improving memory bandwidth utilization. For many workloads, the quality degradation from INT8 quantization is below measurable thresholds. INT4 quantization has larger quality impacts but is becoming acceptable for latency-insensitive workloads or cases where smaller fine-tuned models can recover the quality gap.

Where we are on each curve

Hardware improvements are on a reliable multi-year cycle that is unlikely to stall in the near term. The memory bandwidth bottleneck that limits transformer inference throughput is being addressed by HBM technology advances, and that roadmap is reasonably predictable.

Software optimization for the current generation of transformer architectures is more mature than it was two years ago but far from exhausted. Speculative decoding — using a smaller draft model to predict multiple tokens simultaneously, then verifying them in parallel with the full model — is still being deployed in relatively few production systems. Radix attention and prefix caching for systems that serve many requests with shared prefixes (system prompts, RAG contexts) are still being implemented by most serving frameworks. These gains are available but not yet captured by the majority of production deployments.

Model compression is earlier in its practical deployment curve than most people realize. GPTQ and AWQ for weight-only quantization are reasonably mature. Activation-quantization methods (SmoothQuant, FP8 quantization) are being deployed at scale by only a handful of teams. Speculative decoding combined with quantized draft models is almost entirely an unsolved production deployment problem. There is meaningful headroom here.

The inflection point dynamic

When inference cost falls, two things happen that look like contradiction but are not: the per-unit cost goes down, but total inference spend goes up because new applications become economically feasible. This is Jevons' paradox applied to compute — the more efficient coal became, the more total coal was consumed, because efficiency made new uses of coal economically viable.

We are at the early stage of this dynamic with LLM inference. At $20 per million tokens (rough 2023 pricing for GPT-4-class models), many latency-sensitive or cost-sensitive applications are economically impossible. At $1–2 per million tokens, which is achievable now with optimized open-weight serving, many of those applications become viable. At $0.05–0.10 per million tokens, which is plausibly achievable in 18–24 months through compounded hardware and software improvements, the set of economically viable applications expands dramatically to include things like real-time document processing, inline code review, voice interfaces with high turn frequency, and AI-assisted backend services that currently use traditional rule engines.

Each order-of-magnitude reduction in inference cost creates roughly an order of magnitude of new application surface. The infrastructure companies serving that expanded surface are operating in a growing market, not a shrinking one.

What falling cost means for infrastructure companies

The objection we hear is: "if inference cost keeps falling, won't the need for inference optimization companies also fall?" The answer is no, for a structural reason. The optimization gap — the difference between what a naive deployment achieves and what a fully optimized deployment achieves — is not shrinking as absolute costs fall. If anything, it is growing, because each new hardware generation and new model architecture introduces new optimization opportunities that take time to fully exploit.

The teams at companies building inference infrastructure are not solving the same optimization problem repeatedly; they are solving a continuously advancing set of optimization problems as the hardware substrate and model architectures evolve. That continuous advancement is a feature, not a bug — it is what makes the field technically defensible. The organizations with the deepest operational knowledge of production inference systems will continue to create value even as absolute cost floors move down.