Model Compression at Production Scale

The literature on model compression reads as a collection of promising techniques in search of production constraints. Quantization, weight pruning, knowledge distillation, low-rank factorization — each has a well-developed theoretical basis and a set of benchmark results that look compelling. What the papers don't fully surface: the conditions under which each technique holds those gains in a real serving environment, and the conditions under which it silently degrades in ways that don't show up in standard benchmarks.

We've spent time on this across several portfolio companies, and the practical picture is messier than the research picture. That messiness is actually the opportunity — teams that navigate it well build durable cost advantages that don't erode with the next hardware generation.

Quantization: where the gains are real and where they aren't

Post-training quantization (PTQ) from FP16 to INT8 typically costs 1-3% on standard benchmarks and delivers roughly 1.5-2x memory reduction plus meaningful throughput gains on hardware with fast INT8 GEMM kernels. These numbers are real, and for most production use cases — document processing, code completion, general Q&A — the quality hit is acceptable.

Where PTQ to INT8 starts to break down: tasks requiring precise numerical reasoning, long-context retrieval where small errors compound, and domain-specific fine-tuned models where the calibration dataset used during quantization doesn't match the production input distribution. The last one is the silent killer. A model quantized with a generic text calibration set can lose significant performance on specialized domains (legal, medical, financial) because the activation ranges that matter for those inputs weren't represented in calibration. The benchmark numbers looked fine; production quality did not.

INT4 quantization, especially GPTQ and more recently AWQ (Activation-aware Weight Quantization), pushes further — another 2x memory reduction. The tradeoff sharpens: you're now looking at 5-15% benchmark degradation depending on task and model family, and the tail behavior becomes more variable. For a growing team deploying a customer-facing application for the first time, INT4 is probably not where you start unless cost is genuinely the gate.

Pruning: the production integration problem

Structured pruning — removing attention heads, FFN layers, or entire transformer blocks — can achieve significant parameter reductions, but it has an integration problem that quantization doesn't: the resulting model has a different architecture than the original. This means you can't drop in a pruned model and expect your serving stack to work unchanged. Inference frameworks that are compiled or kernel-optimized for specific architecture shapes (transformer block count, attention head configuration) may not support the pruned variant out of the box.

Unstructured pruning — zeroing individual weights — avoids this problem but doesn't deliver hardware speedup without sparse matrix support in the serving layer. Standard dense GEMM operations on sparse weight tensors don't get faster just because most of the values are zero. You need dedicated sparse inference hardware or kernels, which narrows the deployment target significantly.

The companies that have made pruning work at production scale — and there are several doing it well — generally pair structured pruning with fine-tuning on domain-specific data to recover quality, then rebuild or re-profile the serving stack for the pruned architecture. This is a significant engineering investment that only makes sense at certain scale thresholds. A team doing millions of inferences per day has a very different cost structure than one doing thousands, and the ROI calculation on pruning development effort looks different in each case.

Distillation: the right use case is narrower than it looks

Knowledge distillation — training a smaller student model to match the outputs of a larger teacher model — is conceptually elegant and has produced some of the most capable small models available (the Phi family from Microsoft Research is a good public example of the approach). For deploying at scale, a well-distilled small model often dominates a heavily quantized large model on both cost and quality for a specific task distribution.

The constraint: distillation requires a training run. You need data, compute, and time. If your use case is narrow and stable — a specific document classification task, a specific code domain, a specific response format — distillation may be the highest-ROI compression strategy. If your use case is broad or changing, the training investment doesn't amortize as cleanly.

One of the teams we work with at Titanml has built infrastructure specifically around the distillation-to-deployment pipeline — automating the workflow from large model to task-specific small model to optimized serving artifact. The interesting insight from watching that build: the hard problem isn't the distillation itself, it's the evaluation harness that correctly measures whether the student model is within acceptable quality tolerance for the production task before deployment. Getting that right requires domain-specific test sets that generic benchmarks can't provide.

The compression-serving stack coupling problem

Something that gets insufficient attention: the choice of compression technique couples you to specific serving stack requirements. INT8 quantization requires INT8 GEMM support in your runtime. AWQ requires specific kernel support (currently best in vLLM and TGI with specific CUDA configurations). Speculative decoding with a small distilled draft model requires two models in memory simultaneously. Structured pruning requires a serving stack that supports non-standard architectures.

This means compression decisions aren't just model decisions — they're infrastructure decisions. Teams that make them in isolation (ML team chooses quantization format; infra team deploys to existing serving stack) frequently discover compatibility problems late, after significant work has been done on the model side. The teams that ship this well treat compression strategy and serving stack selection as a joint decision made early, not a handoff problem solved late.

We're not saying that organizations need to rebuild their serving infrastructure every time they evaluate a new compression technique. The point is that the compression decision space is not independent of the deployment environment, and treating it as such creates waste. The path to production-ready compressed models goes through the serving stack first, not after.

A note on hardware evolution and compression ROI

A genuine counterpoint to aggressive model compression: hardware is getting more efficient faster than most compression engineering amortizes. The memory and compute efficiency gains between H100 and A100 are significant. GB200 further changes the equation. If a compression project takes four months and delivers a 2x memory reduction, and the hardware refresh cycle is twelve months, the ROI calculation has to account for the baseline the new hardware would have delivered anyway.

This doesn't make compression investment wrong — there are cases where it clearly pencils out, especially at high request volumes where even small per-token cost improvements are meaningful at scale. But it does mean the build-vs-wait question is worth asking honestly before committing significant engineering time to a compression approach. The teams that navigate this well are the ones with a clear-eyed view of their request volume trajectory and a realistic model of hardware cost curves. That combination is rarer than it should be.