Fine-tuning has gone from an academic technique to an engineering requirement in the span of about eighteen months. The LLaMA release in February 2023 crystallized something that was already becoming clear: the teams building production AI applications could not depend on one-size-fits-all foundation models for specialized workloads. They needed domain-adapted models, and they needed a reliable, repeatable way to produce them. The tooling to do this at production quality, at reasonable cost, with good reproducibility — is still being built.

This is a survey of where the fine-tuning landscape stands as of early 2023. It is written for engineers and infrastructure builders, not as a comprehensive academic review. The goal is to describe the practical state of the methods, identify the real unsolved problems, and point toward where dedicated fine-tuning infrastructure makes sense versus where you should use existing tools.

The main methods and their practical tradeoffs

Full fine-tuning means updating all model weights on a domain-specific dataset. This produces the best-quality adapted model but requires GPU memory proportional to the full model size multiplied by the optimizer state — typically 16x to 24x the model parameter count in bytes for mixed-precision training with Adam. For a 7B parameter model, that is 50–100GB of GPU memory before you account for activation memory. Full fine-tuning is not accessible to teams without dedicated training infrastructure.

LoRA (Low-Rank Adaptation), introduced in the 2021 paper by Hu et al. from Microsoft Research, reduces the trainable parameter count by decomposing weight update matrices into lower-rank factorizations. The key insight is that the weight updates needed for fine-tuning have an intrinsically low-rank structure — you do not need to update the full weight matrix to achieve good adaptation. In practice, LoRA reduces memory requirements by 3x to 10x compared to full fine-tuning while achieving similar adaptation quality for most downstream tasks. The tradeoff is additional inference latency if the LoRA adapters are merged into the base weights at serving time.

QLoRA, from the Dettmers et al. paper released in May 2023, pushes this further by quantizing the base model to 4-bit precision while keeping the LoRA adapter weights in higher precision. The practical effect is dramatic: a 7B parameter model that normally requires 14GB of GPU memory for inference can be fine-tuned on a single 24GB consumer GPU. For the first time, fine-tuning frontier-scale models is accessible to teams without dedicated GPU clusters. QLoRA's quality on most benchmarks is within a few percentage points of full fine-tuning, which is acceptable for many production workloads.

Instruction tuning and RLHF (Reinforcement Learning from Human Feedback) are not fine-tuning methods in the strict sense — they are techniques for aligning a model's behavior with human preferences. Instruction tuning creates a dataset of (prompt, desired response) pairs and fine-tunes the model to follow that format. RLHF is more complex: it involves training a reward model from human preference judgments and then optimizing the base model against that reward signal. RLHF produces the best-aligned models (it is what makes GPT-4 feel "helpful" rather than just capable) but requires significantly more infrastructure complexity and human annotation cost.

Where the tooling gap is real

The academic methods exist and work. The gap is in the production tooling around them. Consider what a team needs to run fine-tuning as a repeatable production workflow, not a one-time experiment:

Dataset management — fine-tuning quality is highly sensitive to data quality, deduplication, formatting, and the ratio of domain-specific data to general data. There are no production-grade tools for fine-tuning dataset pipelines comparable to what exists for training data pipelines in classical ML.

Experiment tracking for fine-tuning runs — weight & biases and MLflow exist, but neither is designed with the fine-tuning workflow in mind. Tracking adapter configurations, quantization settings, merged weight variants, and their corresponding evaluation scores across dozens of fine-tuning runs is still largely a manual process.

Serving fine-tuned models — LoRA adapters can be served in two modes: merged into the base model at serving time (better latency, more storage) or applied dynamically at inference time (allows multiple adapters on one base model instance, but adds latency). The tooling for multi-adapter serving — switching adapters per-request against a shared base model — is barely functional in any production serving system today. This is an unsolved infrastructure problem that will become important as production teams accumulate many fine-tuned variants.

The business case for dedicated fine-tuning infrastructure

The question we hear from engineering teams is: why not just use the fine-tuning APIs from the major model providers? The answer is that vendor-provided fine-tuning APIs are reasonable for simple adaptation tasks — format training, few-shot style adaptation — but break down for the use cases where fine-tuning provides the largest quality improvement.

Domain adaptation for technical fields (legal, medical, scientific) requires large domain-specific datasets that teams are not willing to upload to external APIs. Cost control requires the ability to select the base model appropriate for the task, not the provider's managed model selection. Reproducibility requires version-controlled training runs with full configuration auditability — something SaaS fine-tuning APIs do not provide. And multi-adapter serving — the ability to maintain a portfolio of fine-tuned variants per tenant or use case — is simply not offered by the current generation of managed APIs.

We are not saying vendor fine-tuning APIs are bad. We are saying that the ceiling on what they can deliver for sophisticated use cases is well below what teams can achieve with dedicated fine-tuning infrastructure running on their own compute.

What to watch for in the next twelve months

The QLoRA paper will accelerate the democratization of fine-tuning substantially. We expect to see a significant increase in the number of production teams running their own fine-tuned models by Q4 2023. This will create demand for everything downstream: tooling for managing fine-tuned model registries, serving infrastructure for multi-adapter deployment, evaluation frameworks for fine-tuned model quality assessment, and data pipelines for domain-specific fine-tuning datasets.

The companies building the production layer for this workflow — not just the fine-tuning method itself but the full lifecycle management around it — are working on the right problem at the right time. The method innovation is essentially done; the infrastructure to make it usable at scale is the remaining engineering work.