Inference vs. Training: Where Value Accrues Over Time

The capital allocation story in AI infrastructure has been overwhelmingly skewed toward training. OpenAI, Google DeepMind, Anthropic, and Meta have collectively spent billions of dollars on training compute. The narrative that drove that investment was essentially correct: training breakthrough-scale models required a concentration of capital that most organizations couldn't assemble, and the organizations that could would have a capability advantage.

But training is a one-time event. Inference is every event after that. And the economics of running AI applications at scale are almost entirely determined by what happens in inference — not training. This distinction matters for how you think about where infrastructure value will accumulate over the next decade.

The asymmetry in operational cost

Consider the cost structure of any production AI application. Training a foundation model might cost $5M to $100M, depending on scale — a large number, but a fixed cost paid once. Inference for that same model at production scale might cost $2M to $20M per month, depending on traffic and workload characteristics. After twelve months of production serving, the inference cost exceeds the training cost. After two years, it dwarfs it.

This is not a new dynamic in software infrastructure — it is the same pattern that played out with databases, message queues, and search indexes. The one-time setup cost is dwarfed by the operational cost of serving requests at scale. What is different with AI inference is the degree to which the operational cost can be reduced through engineering — and how much of that reduction potential remains untapped.

A well-optimized inference stack for a production LLM deployment can deliver anywhere from 3x to 15x more throughput per dollar of compute than a naive deployment. That spread is the market opportunity for inference infrastructure companies. The companies that close that gap — through continuous batching, quantization, speculative decoding, efficient memory management, hardware-aware kernel optimization — are creating real economic value for every AI application that uses them.

Why training captured the narrative and most of the capital

Training got attention for a structural reason: training results are visible and benchmark-able. You can demonstrate a capability improvement by publishing a benchmark number. Inference improvements are operationally visible to the team running the system — you see lower costs per request and faster p95 latency — but they do not create the dramatic "this model can now do X that no previous model could" demonstrations that generate press coverage and LP interest.

This is a market inefficiency. The investment community has been optimized to fund the thing that produces headline capabilities, even when the operational reality is that the bottleneck to deploying those capabilities at scale is the serving infrastructure. We think this misallocation created the opportunity for Firntal to invest early in inference infrastructure at valuations that reflected the narrative discount.

The compounding dynamics of inference infrastructure

Training infrastructure has a particular compounding dynamic: the more you train, the better you understand how to configure the next training run, and the better your data infrastructure becomes. The learning is real but it is primarily internal to the model developer.

Inference infrastructure competes on a different learning curve. Every production workload teaches the serving system something about request distribution, batching efficiency, model memory behavior, and hardware utilization patterns. The companies that serve inference at scale accumulate operational knowledge that is genuinely hard to replicate from scratch. A team that has been running production inference for two years has observed workload patterns, failure modes, and optimization opportunities that a team starting fresh cannot simulate.

This operational learning compounds in several ways. It informs kernel optimization — the low-level GPU compute code that determines how efficiently matrix multiplications are executed for a given model architecture and batch size. It informs scheduling decisions — how to multiplex multiple models across a shared GPU fleet to maximize utilization without violating latency SLAs. It informs hardware selection — when to use A100s versus H100s versus AMD MI300X versus custom accelerators, given specific cost-per-token targets and latency requirements.

The model commoditization accelerant

There is a specific dynamic that strengthens the inference infrastructure thesis: model commoditization accelerates inference infrastructure value. As capable open-weight models become available, the barriers to running a high-quality model drop dramatically. A growing engineering team can now run a frontier-quality model on their own infrastructure. But running it well — efficiently, cheaply, at the latency their application requires — still requires inference infrastructure that most teams cannot build from scratch.

This is the transition we are watching play out with LLaMA and its derivatives. The question for application developers is no longer "can we afford to run a GPT-4-quality model?" but "can we run this open-weight model cost-effectively in production?" That second question is an inference infrastructure question, not a model quality question. The more capable open-weight models become, the larger the market for the serving infrastructure that makes them usable at scale.

What this means for investment decisions

We are not saying training infrastructure is a bad investment. There are strong companies building there, and the market is large. We are saying that the economics of production AI heavily favor the inference side over a multi-year time horizon, and that the narrative overhang of "training is where the AI happens" has created a mispricing that benefits disciplined infrastructure investors.

The infrastructure thesis we backed with Fund I was essentially: training will be dominated by a few well-capitalized players who can sustain billion-dollar compute spends, but inference is a distributed, many-company problem that scales with every production AI deployment. Every company that trains a model or uses a foundation model API also has an inference cost problem. The total addressable market for inference infrastructure is every production AI deployment — which, over the next decade, means most of software.

That is a thesis we continue to believe. The scale of production AI deployments we are seeing in Fund II's portfolio companies validates it more concretely than we could model in 2022.