Why Fine-Tuning Infrastructure Needed a Dedicated Company

Our investment in Predibase in 2022 was an early bet on the idea that fine-tuning infrastructure would become a first-class engineering problem — not an academic exercise, not an occasional ML task, but a production engineering workflow that teams would run repeatedly and systematically as a core part of their AI product development. Eighteen months later, the bet looks more correct than we anticipated. Here is the reasoning behind it, and why we think the infrastructure opportunity is larger than the first generation of companies are capturing.

The problem: fine-tuning is not the hard part

The common framing of "fine-tuning infrastructure" focuses on the fine-tuning step itself — the training run that adapts a base model to a new domain or task. This framing is wrong, in a useful way. Fine-tuning, as a compute operation, is well-understood and progressively easier to execute. LoRA and QLoRA have brought the compute requirements within reach of most teams. The hard part is everything else.

Consider what a team needs to fine-tune models in production, not as a one-time experiment but as a systematic workflow. They need to manage and version training datasets in a way that makes training runs reproducible. They need to track fine-tuning experiments — hyperparameter configurations, adapter ranks, learning rates, evaluation scores — across potentially hundreds of runs. They need to evaluate fine-tuned model quality reliably, which requires either a held-out evaluation dataset, an automated evaluation pipeline, or human evaluation — each with its own infrastructure requirements. They need to package fine-tuned model artifacts in a way that is deployable, versionable, and rollback-able. And they need to serve those fine-tuned models at production latency, which for LoRA-adapted models involves either merging adapters into base weights or doing dynamic adapter loading at inference time.

None of these problems are solved by fine-tuning libraries or by raw compute access. They require a purpose-built infrastructure stack.

The serving problem is the real moat

The aspect of fine-tuning infrastructure that is most technically interesting — and most underappreciated — is multi-adapter serving. Once a team has fine-tuned models, they typically end up with many of them: different adapters for different tasks, different user segments, different product features, potentially different per-customer model variants if they are building an AI product with customization. Serving this portfolio of fine-tuned models efficiently is a hard systems problem.

The naive approach is to run a separate model server instance for each fine-tuned variant. This is prohibitively expensive at scale: a team with 50 fine-tuned adapters cannot afford to run 50 dedicated GPU instances. The efficient approach is to run a shared base model instance and apply adapters dynamically at request time — switching adapters as requests arrive, batching requests that use the same adapter, managing adapter memory between requests.

This dynamic adapter loading architecture — sometimes called LoRA-serving or multi-LoRA inference — requires implementing the adapter switching in the serving kernel, managing adapter memory in VRAM alongside the base model weights, and handling adapter cache management when the number of adapters exceeds what can be held in GPU memory simultaneously. This is non-trivial GPU systems engineering. The implementations that handle it well at production serving latency are the product of months of engineering work by people who understand GPU memory architecture.

Why this didn't consolidate into general-purpose ML platforms

The reasonable question is: why didn't the existing general-purpose ML platforms (MLflow, Kubeflow, SageMaker) extend to cover fine-tuning infrastructure? The answer is that they were designed for the training-focused ML workflow of 2018–2021, where the primary artifact being managed was a trained model checkpoint and the primary workflow was: prepare data → train → evaluate → deploy. Fine-tuning for LLMs is a different workflow with different characteristics.

LLM fine-tuning operates on base models that are orders of magnitude larger than classical ML models, requiring different approaches to versioning and storage. The quality evaluation problem is qualitatively different — you cannot evaluate fine-tuned LLM quality with a simple accuracy metric on a labeled test set; you need generation quality evaluation, which may require another model as a judge or human evaluation at scale. And the serving artifact is different: you are not deploying a self-contained model but an adapter that depends on a specific base model version.

General-purpose ML platforms could in principle add these capabilities, but they require substantial architectural changes that conflict with the existing platform design. The result is that teams building production LLM fine-tuning workflows today either cobble together multiple tools (training library + experiment tracker + custom serving code + custom evaluation pipeline) or use a purpose-built fine-tuning infrastructure platform. The latter is clearly better for velocity and reliability.

What the market looks like at scale

Every organization that runs custom AI products — not just using foundation model APIs but deploying customized models — will eventually need fine-tuning infrastructure. As the cost of fine-tuning continues to fall and the quality bar for production AI rises, the fraction of AI products that use custom models will grow. The fine-tuning infrastructure market scales with the production AI deployment market, not just with the research market.

We expect the fine-tuning infrastructure category to consolidate around a small number of platforms that have solved the full stack: dataset management, training orchestration, experiment tracking, evaluation pipelines, model registry, and production serving — including multi-adapter inference at scale. The companies that build the full stack early will have strong switching costs once engineering teams have standardized their workflows on them. The companies that solve only parts of it are vulnerable to consolidation by the full-stack players.

The window to build this full stack is now. The demand is real, the technical problems are hard, and the market has not yet consolidated.