MLOps Best Practices: What Keeps Models Running in Production
A practitioner's guide to mlops best practices — from CI/CD pipeline automation and model versioning to drift detection and continuous retraining — based on what breaks at 3am.
The phrase “mlops best practices” covers a lot of ground, and most writeups present them as a checklist. This one doesn’t. Instead: here’s what each practice is actually protecting against, and what breaks when you skip it.
Google’s MLOps maturity framework ↗ describes three levels of pipeline maturity, from manual notebook-driven workflows to fully automated CI/CD pipelines for ML. Most organizations that have “deployed a model” are sitting at Level 0 — which means a data scientist handed off a pickle file, an engineer wrapped it in Flask, and nobody set up monitoring. This is not a criticism; it’s just the default outcome when teams optimize for shipping a model rather than operating one.
The good news: you don’t need to reach Level 2 to run a reliable production ML system. You need a handful of specific practices in place, in the right order.
Version Everything — Code, Data, and Models Together
The most common reproducibility failure is a model that can’t be retrained to its original performance because nobody logged which dataset version, preprocessing script, and hyperparameters produced it. Six months later, when you need to retrain because distribution has shifted, you’re starting over.
The fix is treating code, data, and model artifacts as a versioned unit. Git handles code. DVC or lakeFS handles dataset versions in object storage. MLflow or Weights & Biases handles experiment runs — parameters, metrics, and the artifact pointer that ties them together. The model registry is the handoff point: a registered model with a lineage trace back to its training run and dataset version.
Without this, your pipeline has no memory. With it, you can answer: “What data produced the model currently serving 40% of traffic, and what was its F1 score on the holdout set?” That question gets asked more often than you’d expect.
Automate the Pipeline, Not Just the Training Script
Level 1 in Google’s maturity model means your ML pipeline is orchestrated end-to-end — data extraction, validation, preprocessing, training, model validation, and deployment — triggered by a schedule, new data, or a detected performance drop. Not a cron job that kicks off a training script. The whole pipeline.
Two gates matter most:
Data validation runs before training. If your input schema has changed, if a feature’s statistical distribution has shifted beyond a threshold, or if expected values are missing, the pipeline fails fast with a clear error rather than silently training a degraded model.
Model validation runs after training. The new model must beat the current production model on a holdout set before it gets promoted. No exceptions. This is the difference between continuous training and continuous gambling.
Airflow, Prefect, and Dagster are the common orchestration choices. Kubeflow Pipelines if you’re Kubernetes-native. The tool matters less than enforcing the validation gates.
Monitor Drift Before It Kills Your Metrics
Production models degrade. The question is whether you find out from a monitoring alert or from a product manager noticing a KPI drop.
Two drift types matter operationally:
Data drift — the distribution of your input features shifts away from the training distribution. A model trained on last year’s user behavior gets production requests from a different user population. It still runs. It just stops being accurate.
Concept drift — the underlying relationship between features and target changes. A fraud detection model trained before a new payment method launched sees the new transaction patterns as anomalies, then adapts wrong.
The monitoring setup recommended by Evidently AI ↗ starts simple: compute a baseline distribution from your training data for key features (mean, variance, category proportions), then continuously compare production data against that baseline using statistical tests — Population Stability Index (PSI), Kolmogorov-Smirnov tests, or Jensen-Shannon divergence. When a feature’s drift score crosses a threshold, alert.
Start with ad-hoc drift reports to understand what “normal” drift looks like for your specific model and data. Then move to scheduled reports via Airflow or Prefect, aggregated into a dashboard. Add alerting last, once you understand what threshold actually warrants a page versus just a Slack message. For a deeper rundown of monitoring patterns, alerting thresholds, and tooling comparisons, mlmonitoring.report ↗ covers this continuously.
Define Retraining Triggers Before You Need Them
The worst time to figure out your retraining policy is when your model is already degraded. Define it upfront:
- Scheduled retraining: every N days, regardless of observed drift. Simple, predictable, doesn’t require a monitoring system. Works for slowly-changing domains.
- Performance-triggered retraining: retrain when a model performance metric — accuracy, precision, AUC — drops below a threshold on your evaluation set or on labeled production samples.
- Drift-triggered retraining: retrain when input distributions shift beyond a threshold, even before you see performance degrade. This is predictive rather than reactive, but it requires confidence in your drift detection.
Most production systems use a combination: scheduled retraining as a floor, drift/performance triggers as an override. The key is that retraining goes through the same validation pipeline as the original training run — not a shortcut that skips the model validation gate.
Treat Infrastructure as Code
Manually configured training clusters and serving infrastructure are a reliability liability. When something breaks, you’re debugging both the model and the environment simultaneously. Terraform, Kubernetes manifests, and containerized pipeline components give you reproducible infrastructure that can be version-controlled, reviewed, and rolled back.
Containerizing pipeline components also prevents training-serving skew — the silent killer where the preprocessing logic in your training pipeline diverges from the preprocessing logic in your serving layer. Same container image, same code, same behavior.
For teams evaluating which MLOps platforms handle infrastructure management well, mlopsplatforms.com ↗ maintains updated comparisons of managed offerings versus open-source stacks.
What to Instrument First
If you’re starting from scratch, the order is: versioning first (you need lineage before anything else), then pipeline orchestration with validation gates, then drift monitoring on your top-five most predictive features before you try to monitor everything. Add retraining automation after you’ve observed at least one real drift event and understand what triggered it.
The one metric worth putting on a dashboard before anything else: the distribution distance between your training data and current production data, updated daily. When that number starts climbing, everything else follows.
Sources
MLOps: Continuous Delivery and Automation Pipelines in Machine Learning ↗ — Google Cloud’s canonical MLOps maturity framework covering Levels 0, 1, and 2 — from manual deployments to fully automated CI/CD pipelines. Defines the key components of production ML pipelines including data validation, model validation, and continuous training triggers.
How to Start with ML Model Monitoring ↗ — Evidently AI’s step-by-step guide to building an ML monitoring pipeline, covering drift detection methods, metric selection, dashboard setup, and progressive automation from ad-hoc reports to alerting. Practical and tool-specific.
8 MLOps Best Practices You Should Implement in 2026 ↗ — Practitioner-focused breakdown of eight core MLOps practices: versioning, CI/CD automation, post-deployment monitoring, governance, reproducibility, infrastructure-as-code, retraining planning, and cross-team collaboration. Published February 2026.
Sources
SentryML — in your inbox
ML observability & MLOps — model monitoring, drift detection, debugging in production. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
LLM Benchmarks in 2026: Which Still Discriminate, and How to Run
Static benchmarks like MMLU and HumanEval have saturated for frontier models. Here's which LLM benchmarks still produce signal, why contamination is worse than reported, and how to run your own reproducible evaluation with lm-evaluation-harness.
LLM Fine Tuning: Methods, Training Data, and Evaluation
A practitioner's guide to llm fine tuning — how to pick between SFT, LoRA, and DPO, what your training data actually needs, and how to validate a fine-tuned model before it hits production.
ML Testing: A Checklist from Pre-Train Checks to Production Drift
ML testing spans pre-train sanity checks, behavioral validation, data integrity, and continuous drift monitoring. Here's what actually belongs in your CI pipeline and runbook.