Inference Market Consolidation: A March 2026 Update

The inference infrastructure market in early 2026 looks structurally different from what it looked like when we closed Fund II eighteen months ago. Some of what has changed was predictable from the thesis we wrote at Fund I close. Some of it reflects dynamics that accelerated faster than our models anticipated. I want to write down an honest assessment of both, because our LPs deserve it and because the inference market is complex enough that high-level takes — "consolidation is happening" or "it's still fragmented" — don't actually tell you where the value is going.

The short version: the serving layer below the application tier is consolidating faster than expected, primarily through hyperscaler absorption of inference-as-a-service capabilities. The optimization and tooling layer immediately above the raw serving infrastructure is staying more fragmented than expected, because the problem set keeps advancing. These two trends are moving at the same time in different directions, and the investment implications are different for each.

Serving layer consolidation: faster than expected

In our Fund I thesis, we anticipated that the major cloud providers would eventually build competitive managed inference services — the equivalent of what SageMaker was for ML training workloads. We expected this to take four to five years after the LLM wave became commercially obvious. It has happened faster.

AWS Bedrock, Google Vertex AI, and Azure AI Foundry have each reached a level of capability and pricing competitiveness that makes them a credible default choice for organizations that don't have a specific reason to run their own inference infrastructure. The managed services now cover the main open-source model families at multiple quantization levels, with SLA-backed latency and throughput guarantees, integrated monitoring, and pay-per-token pricing that is competitive with self-managed deployments for most workload profiles. The integration overhead of these services is now lower than the engineering overhead of self-managing inference infrastructure for the majority of enterprise use cases.

This has compressed the addressable market for pure-play inference-as-a-service companies — the tier of companies building GPU clouds specifically to serve LLM inference at scale. That compression is real and we've seen it in the fundraising environment: companies raising for generic inference-as-a-service rounds in late 2025 faced significantly more LP skepticism than they would have in 2023. The competitive question — "why won't AWS win this?" — is now much harder to answer with confidence for generic serving workloads.

The companies that have held their position in this environment are those that built specific advantages in one or more of: hardware mix optimization for cost-sensitive workloads (using AMD MI300X and custom ASIC alongside Nvidia), geographic coverage in markets where the major hyperscalers have limited data sovereignty options, or vertical-specific SLAs for industries with specific compliance requirements. Those moats are real but narrower than the broad inference-as-a-service opportunity looked in 2022.

Tooling layer fragmentation: more persistent than expected

The layer above raw serving — the tooling that makes inference systems deployable, observable, optimizable, and cost-manageable for engineering teams — is more fragmented in 2026 than our models suggested it would be. We expected that one or two platforms would consolidate the major workflows: fine-tuning orchestration, model evaluation, deployment versioning, cost monitoring, and model routing. That hasn't happened.

The reason it hasn't happened is that the problem set keeps advancing faster than any platform can stabilize. Fine-tuning tooling that was state-of-the-art for GPT-3.5-class models needed significant rearchitecting for the 70B+ instruction-tuned models that became standard in 2024. Model routing that was designed for a world with two or three capable models needed to be rethought as the number of viable candidate models expanded. Evaluation frameworks built before multimodal capabilities became common couldn't handle the new evaluation surface. Each of these transitions fragmented the market because the teams that had built deep expertise in the prior generation weren't always the ones who moved fastest to the new problem set.

We are not saying the tooling market will remain fragmented forever. We are saying that the consolidation thesis — one platform wins model lifecycle management the way GitHub won code version control — depends on the problem set stabilizing enough that one platform can build durable depth across the major workflows. That stabilization hasn't happened yet, and based on the current pace of model architecture and capability changes, we don't expect it in the next 18 months.

Where value is accruing in the current environment

Given this structural picture, the companies in our portfolio that have the most durable value positioning are those building at the intersection of the two trends: tools that help engineering teams operate efficiently on top of the managed serving infrastructure that has commoditized, rather than replacing the serving infrastructure itself.

Model routing is the clearest example. As managed inference services have proliferated and as the number of viable model options has expanded, the decision of which model to route a given request to has become more consequential and more complex. Routing decisions that optimize simultaneously for cost, latency, and quality on a per-request basis — based on the specific characteristics of the request, the current availability and pricing of different model tiers, and historical performance data for comparable requests — represent a non-trivial optimization problem that the hyperscalers have not solved well in their default offerings. The companies building sophisticated routing infrastructure on top of the commoditized serving layer are creating value that compounds with the proliferation of models rather than being undermined by it.

Fine-tuning infrastructure is a second area. The shift from few-shot prompting to fine-tuned model deployment has accelerated as the cost of fine-tuning has fallen and as the quality gap between fine-tuned domain-specific models and prompted general-purpose models has become clear for structured tasks. But the operational overhead of fine-tuning — dataset management, training run orchestration, evaluation, safety testing, deployment — remains high for most engineering teams. Companies reducing that overhead while maintaining production-grade reliability are addressing a market that is growing with model fine-tuning adoption rather than shrinking with serving commoditization.

The portfolio position from here

Fund II has three remaining deployment slots. Based on the current market picture, we're actively evaluating companies in model observability, advanced routing, and the developer tooling around fine-tuned model lifecycle management. We are less interested, at current valuations, in companies whose primary value proposition is raw GPU compute access or generic managed inference — not because these aren't real businesses, but because the exit multiples available in those categories have compressed along with the fundraising environment.

The inference infrastructure thesis has aged well in the sense that the category we identified as important is now clearly important — there is no serious debate about whether inference infrastructure matters. The thesis has been stress-tested in the sense that the hyperscaler acceleration of managed services forced us to sharpen our thinking about which layers of the stack produce durable value versus which produce transitional value that gets absorbed. We entered this period with that distinction already in our analysis. The market has now validated the distinction empirically. We think it will continue to matter through the back half of Fund II's deployment period and well into Fund III, which we expect to begin discussing with LPs in late 2026.

A note on team evolution

We've added Niklas Hofer as a full partner over the past year, focused specifically on hardware economics and GPU cloud dynamics. That addition reflects our view that the infrastructure layer requires deep hardware economics knowledge to evaluate correctly in 2026 — the days when inference infrastructure investment decisions were primarily software architecture decisions are behind us. The best opportunities now require understanding the interaction between software optimization, hardware characteristics, and cloud economics simultaneously. Having a partner who built GPU scheduling systems at a hyperscale provider before joining Firntal is not a nice-to-have at this stage of the market; it's the core of the technical diligence capability. His first full piece — on sovereign AI cloud dynamics — went up in December and reflects the way we're thinking about the geographic dimension of the consolidation picture that this post has focused on at the technology layer.