Why CI/CD for ML Pipelines Is the Right Abstraction

The mental model of "CI/CD for machine learning" gets dismissed as reductive by a lot of ML engineers. The argument goes: training pipelines aren't like software pipelines. Data isn't code. Evaluation is probabilistic, not binary. You can't just run a test suite and merge the model to main.

That argument is correct as far as it goes. But it misidentifies the problem CI/CD actually solves — and therefore misses why the abstraction is still right, even if the tooling has to be different.

When we backed Dagger in 2024, we were betting on a team that understood this distinction. The hypothesis isn't "ML pipelines are software pipelines." It's that the operational discipline of CI/CD — deterministic artifact creation, reproducibility across environments, clear promotion gates, rollback capability — translates directly to the ML lifecycle, even though the implementation details are completely different.

What CI/CD actually solves (and why ML needs it)

Before containers and modern CI/CD, software deployment was a manual, environment-sensitive process. Builds worked on one machine and failed on another. Production diverged from staging in undocumented ways. Rollback meant restoring a backup, not reverting a commit. The "it works on my machine" failure mode was ubiquitous.

CI/CD solved this by encoding the production path as a reproducible, versioned artifact. The pipeline — not individual developer machines — became the authoritative builder. Environments became declarative. Promotion gates (tests, linting, security scans) became systematic rather than ad-hoc.

ML pipelines have every one of these problems, but worse. The "artifact" is a model checkpoint that might be 40GB. The "environment" includes not just software dependencies but data snapshots, random seeds, hardware platform, and CUDA driver version. "Tests" are evaluation runs that take hours and produce metrics on a distribution, not pass/fail signals. "Staging" might mean a shadow deployment that runs against real user traffic at reduced weight.

The specific mechanisms of CI/CD don't map one-to-one. But the discipline — automate the path from source to production, make it reproducible, create clear gates that humans review before promotion — applies directly. The teams that don't have this discipline are doing manual shepherding of model checkpoints, and the cost compounds: you can't audit why a model changed behavior last Tuesday because there's no deterministic record of what pipeline run produced it.

The container layer as common infrastructure

One technical choice that made Dagger's approach compelling: treating containers as the universal execution environment. Training steps, evaluation harnesses, data preprocessing, model registration — each runs inside a container with pinned dependencies. The container spec is checked into version control alongside the pipeline code.

This matters because it solves the environment divergence problem without requiring teams to adopt an entirely new workflow. Engineers who already know Docker don't need to learn a new paradigm. The DAG of pipeline steps maps cleanly onto a DAG of container executions. And because containers are reproducible by construction (given the same base image and the same build instructions), you get environment reproducibility for free — or at least you get it at the container level, which is a significant portion of the total reproducibility problem.

What doesn't get solved at the container layer: data versioning. The model checkpoint produced by a training run is a function of the code, the environment, and the training data. Pin the first two with containers; if your data snapshot isn't versioned, you still can't reproduce the artifact. This is where the tooling ecosystem around DVC, Delta Lake, and data contract frameworks plugs in. The pipeline tool and the data versioning tool are different concerns — and teams that conflate them tend to over-engineer one and ignore the other.

Evaluation gates: the hard problem

In software CI/CD, a test either passes or fails. The gate is binary. In ML pipelines, evaluation produces a distribution of metrics, and "passes" means "is better than the current production model on these metrics by more than this threshold."

The threshold question is deceptively hard. A naive approach sets a single primary metric (BLEU score, F1, accuracy) and promotes the model if it improves. This fails in practice because improvements on one metric can mask regressions on others — a model that's 2% better on average accuracy might be significantly worse on the tail distribution that matters for your use case. A recommendation system that improves CTR might degrade session quality metrics that weren't in the gate definition.

The better approach defines a promotion gate as a vector of metrics with both minimum thresholds (no regression on these signals) and target improvements (this model gets promoted if it meets all minimums AND improves on this primary metric). Building this evaluation framework well — with the right holdout sets, the right business-aligned metrics alongside model metrics, and the right escalation path when the signals are mixed — is a significant engineering investment. It's also where most teams cut corners under shipping pressure.

We're not saying that solving the evaluation gate problem makes ML pipelines equivalent to software pipelines. The probabilistic nature is real. But the alternative — promoting models on vibes and manual review — doesn't scale past a certain deployment cadence, and it makes incident response nearly impossible because you lose the audit trail.

Deployment as a first-class pipeline stage

The other place where ML CI/CD implementations fall short: treating deployment as outside the pipeline scope. Training and evaluation get automated; deployment is still a manual step where someone runs a script on their laptop and updates a config file in production.

The problem with this: deployment automation for ML is genuinely more complex than for typical software. Shadow deployment, traffic splitting for A/B evaluation, canary rollouts, and rollback all require coordination between the model registry, the serving layer, and the monitoring system. This coordination is too complex for manual execution at reasonable deployment frequency.

What changes when deployment is in the pipeline: the promotion gate in evaluation now connects directly to deployment automation. A model that passes its gates gets queued for production rollout. The rollout follows a defined protocol — say, 5% traffic for 4 hours, check for serving error rate and business metric regressions, then advance to 25%, full rollout. If any check fails, automatic rollback to the previous checkpoint.

This closes the loop that traditional CI/CD closed for software: the pipeline, not the individual engineer, is the authoritative promotion mechanism. Engineers define the gates and protocols; the system executes them. That's the discipline, not the specific tooling.