Model Serving Cost Economics in Early 2026

I want to do a ground-level accounting of where model serving costs actually stand in early 2026, because the high-level narratives — "inference is getting much cheaper" and "GPU capacity is constrained" — are both true but they're describing different parts of the problem and the conflation obscures what's actually happening.

The short version: throughput-optimized serving for batch workloads has gotten dramatically cheaper. Latency-sensitive interactive serving has gotten cheaper more slowly, and in some configurations has gotten more expensive as model sizes at the frontier tier have grown. The cost trajectory depends entirely on what you're serving and with what latency requirement, and any single statement about "inference costs" that doesn't specify those two dimensions is not a useful claim.

Where costs have fallen dramatically

Throughput-optimized batch inference — the workload profile where you have many requests to process and latency can range from seconds to minutes — has benefited from three reinforcing cost improvements over the past 18 months. Hardware throughput has increased sharply: H100 to H200 to Blackwell is roughly a 3–4x improvement in raw compute throughput per dollar of hardware cost over that period, depending on how you measure and which specific configuration you benchmark. Software-layer optimizations — continuous batching, FlashAttention variants, speculative decoding — have added another 1.5–2x throughput improvement on top of hardware gains for many common model architectures. And competitive GPU cloud pricing has compressed provider margins enough that the per-token cost passed through to customers has fallen faster than hardware cost alone would predict.

For batch processing workloads — document summarization, classification pipelines, data extraction, embedding generation — the per-token cost of running a capable open-source model on current-generation hardware is roughly one-tenth of what it was at the beginning of 2023. This is a genuine and significant cost improvement. It has made entire categories of AI-augmented workflows economically viable that weren't before. Applications that were running cost-benefit analyses in 2023 and concluding "not yet" are re-running those analyses now and concluding "now."

Where costs have not fallen as dramatically

Interactive serving — the workload where a human is waiting for a response and time-to-first-token and per-token generation latency matter directly for user experience — has a different cost structure. The problem is that optimizing for low latency and optimizing for throughput are fundamentally in tension. Continuous batching improves throughput by accumulating multiple requests and processing them together; it adds latency for individual requests that would otherwise be processed immediately. Speculative decoding reduces effective generation latency for many request types but adds complexity and occasionally increases cost per token for requests where speculation mismatch rates are high.

For frontier-tier models at the sizes that have become standard for quality-sensitive interactive applications — 70B parameter models and above, or multimodal models with similar compute footprints — the per-token cost of serving with p95 time-to-first-token below 500ms on current hardware is meaningfully higher than the batch throughput cost numbers that get cited in "inference is cheap now" discussions. The hardware provisioned for low-latency interactive serving runs at lower utilization than hardware provisioned for batch workloads because you can't fully load a GPU with interactive requests without violating latency commitments during demand spikes. That utilization gap is a direct cost multiplier.

We are not saying interactive serving economics are broken. We are saying the cost improvement curve for interactive serving has been slower than for batch serving, and the gap between them is wider than most people outside the infrastructure layer realize. This matters for investment thesis evaluation: a company building infrastructure primarily for batch workloads is operating in a market where cost advantages are harder to sustain because the hardware and software baseline is improving so fast. A company building infrastructure specifically for low-latency interactive serving is addressing a problem where the efficiency gap is larger and the optimization surface is more durable.

The model size problem

The other dimension that makes "inference costs are falling" an incomplete claim is what's happening to frontier model sizes. From 2020 to 2023, the most capable models were growing in parameter count as each new training run tried to push benchmark performance higher. From 2023 onward, the pattern has partially reversed: the published benchmarks for smaller models have improved dramatically as training data quality and training efficiency have improved, and many production use cases that previously required a 70B or 175B parameter model can now be served adequately by a 7B or 13B parameter model that was fine-tuned on domain-specific data.

This smaller-model shift is the primary driver of the cost improvement story in interactive serving. A 7B parameter fine-tuned model serving at low latency is genuinely cheap by 2023 standards. But not all applications have completed this transition. The applications where frontier-tier model quality is still required — complex multi-step reasoning, sophisticated code generation, nuanced summarization of long and specialized documents — are still running on large models that have not gotten proportionally cheaper to serve interactively. The cost improvement story is real; it applies fully to applications that have been able to migrate to smaller fine-tuned models, and partially to applications that haven't.

What this means for the infrastructure investment thesis

The implication for where we invest is that the most durable infrastructure opportunities are those where the cost and performance optimization problem is still genuinely hard — not in the "nobody has tried" sense, but in the "the problem keeps getting harder as the use cases advance" sense.

Fine-tuning infrastructure is one such area. As more applications shift to smaller fine-tuned models, the infrastructure for producing, managing, and serving those models at production scale becomes more critical. The fine-tuning workflow — dataset curation, training run orchestration, evaluation, versioning, deployment — is not handled well by the general-purpose MLOps tooling that was built primarily for training large models from scratch. Companies building first-class tooling specifically for the fine-tuning-to-deployment workflow are solving a problem that is getting more important as the application layer shifts toward model customization.

Low-latency serving optimization is another. The gap between what's theoretically achievable in time-to-first-token for large models and what most production deployments actually achieve is still substantial. The gap exists because production systems make conservative trade-offs — over-provisioning hardware to handle demand spikes, using batch sizes that leave some latency headroom, running older kernel implementations that are easier to maintain than cutting-edge optimized ones. Companies that can close this gap for specific model architectures and hardware configurations are building something that continues to matter even as the hardware baseline improves.

The infrastructure layer in AI is not a static market. The problem set advances as the application layer advances. That continuous advancement is exactly the property that makes it a place to invest over a decade-long fund horizon, rather than a place to make a point-in-time bet that the current generation of tools will age well.

What we watch for in the next twelve months

Two technical developments will have outsized impact on serving economics through 2026. The first is how Blackwell-class hardware deploys at scale — specifically whether the per-token cost improvements on inference workloads match the theoretical compute throughput improvements, or whether memory bandwidth limitations constrain real-world gains for large model serving. The second is how speculative decoding matures for production deployment: the theoretical throughput gains are compelling, but production reliability and the overhead of maintaining and updating draft models have slowed adoption. If these get solved in the next 12 months, low-latency interactive serving costs could fall faster than the current trajectory suggests.

Neither of these developments would fundamentally change the investment thesis — they would shift the rate at which specific optimization approaches become commoditized and clarify which remaining efficiency gaps are worth building companies around. Staying close to the hardware evolution and the optimization research is essential for making those calls correctly. It's one of the things that makes running an infrastructure-focused fund technically demanding in a way that a pure SaaS fund isn't — and it's one of the reasons we spend time in the lab and with engineering teams rather than primarily in board rooms.