Why We Backed Inference Before the World Cared

We closed Firntal Fund I in June 2022. At the time, the founding question we kept hearing from LPs was not "is your thesis right?" but something closer to "isn't this too narrow?" Eight investments focused almost entirely on inference infrastructure — the systems layer between a trained model and a response that actually reaches an end user — seemed like an odd place to concentrate $28M.

Eighteen months later, that question has reversed. Now the world wants to know how you were there before the stampede. This is an attempt to write down the actual reasoning, while it's still early enough that we can be honest about what we knew versus what we guessed.

The problem we saw coming out of the lab

My background before founding Firntal was building GPU scheduling infrastructure at a hyperscale cloud provider in Zurich. Not the training layer — the layer that runs after training finishes and a model needs to serve actual requests at scale. That work taught me something that research papers don't make obvious: training a model is a one-time cost. Serving it is a permanent, compounding, operationally complex cost that every production deployment inherits.

In 2020 and 2021, the AI field was living on the excitement of what models could do in benchmarks. GPT-3 was genuinely stunning. But almost every deployment story I heard from engineers at the time had the same structure: "we got it working in the lab, we tried to productionize it, and the economics broke immediately." Latency was three to ten times too high for any interactive application. Cost per request was off by an order of magnitude. Cold start times on GPU instances were creating user experiences that no product team could ship.

None of this was a fundamental model quality problem. It was an infrastructure problem — specifically, an inference infrastructure problem that nobody had built first-class tooling to solve.

The gap between training and deployment tooling

The disparity between the maturity of training tools and the maturity of inference tools was striking. By 2021, you had robust frameworks for distributed training, automated hyperparameter tuning, experiment tracking, dataset management. The training side of the ML lifecycle had accumulated years of engineering attention. The inference side had almost none.

Serving a model at production scale required you to solve: batching requests to maximize GPU utilization without blowing up latency, managing model versions and rollouts, handling hardware heterogeneity across cloud providers, scaling from zero to peak traffic without cold-start death, and managing cost when your model was too large to fit on a single GPU with default configurations. Each of these was a PhD-level systems problem being solved ad hoc by application engineers who did not have the background to solve them well.

The companies that understood this gap were building inside hyperscalers — Google, Meta, Microsoft — and those systems were not going to be productized and made available to the rest of the ecosystem. The gap was real and structural, not a temporary tooling lag that would close on its own.

Why inference, not training

The obvious objection to an inference-first thesis in 2022 was: "but foundation model companies are raising the real money — why not invest in the models themselves?" We thought about this seriously and came to a position we've held consistently: the model layer will commoditize faster than anyone expects, and the infrastructure layer beneath it will accrue durable value precisely because of that commoditization.

We are not saying the model layer is worthless. We are saying that model capability is becoming a feature of compute spend and data access — and those are categories that compound toward commodity faster than software does. The inference infrastructure layer, by contrast, gets more defensible as the model layer gets more competitive. When you have ten capable models to choose from, you need routing infrastructure. When you have models deployed across multiple clouds, you need serving abstraction. When cost pressure from competitive models forces you to run smaller fine-tuned versions, you need fine-tuning tooling. Each of those trends strengthens the inference infrastructure thesis rather than undermining it.

What we got right and what we got wrong

We got the timing approximately right. The window of maximum opportunity for inference infrastructure companies is roughly 2022 to 2026 — after LLMs are clearly working but before the major cloud providers have fully bundled and commoditized the adjacent serving tools. Fund I's deployment period sits exactly in that window.

We got the attack vectors approximately right. Fine-tuning infrastructure, serverless GPU serving, model compression, model routing — those are exactly the categories where we see real companies with real traction building today. We invested in teams working on each of them before the category names were standard vocabulary.

What we were less right about: the speed at which inference costs would fall due to hardware improvements alone. We anticipated that the efficiency gains would come primarily from software — better batching, quantization, kernel optimization — but the hardware trajectory (H100 → AMD MI300X → Blackwell) has also been a massive driver. This has made the market bigger faster than we modeled, which is a good problem to have. It also means that teams solving software-layer efficiency problems have had to stay technically differentiated as the hardware baseline moved rapidly underneath them — which has raised the bar for what "good" looks like in the portfolio.

What we want to build with Fund II

Fund II closes at $40M — roughly 40% larger than Fund I. The thesis has not changed. The vocabulary around the thesis has changed because the world has caught up to the problem set we were investing in. When we wrote our Fund I LP memo in early 2022, we had to spend the first two pages establishing that inference cost and latency were real production problems before we could describe companies working on them. We no longer need to do that.

What Fund II adds is a sharper focus on the next-layer problems: model routing at scale, edge and hybrid inference architectures, and the developer tooling that makes fine-tuned models behave like first-class deployable artifacts rather than research outputs that require heroic engineering to serve. The thesis is the same; the companies are working on harder versions of it.

We are also more deliberate about what we are not doing with Fund II. We are not investing in AI applications — in the companies using inference infrastructure to build end-user products. That is a large market and will produce large companies, but it is not the layer where our operating knowledge compounds. We have been inside the infrastructure layer, building it, for most of our careers. That is where our judgment is worth something, and where we will stay.