MLOps Best Practices: What Keeps Models Running in Production

The phrase “mlops best practices” covers a lot of ground, and most writeups present them as a checklist. This one doesn’t. Instead: here’s what each practice is actually protecting against, and what breaks when you skip it.

Google’s MLOps maturity framework ↗ describes three levels of pipeline maturity, from manual notebook-driven workflows to fully automated CI/CD pipelines for ML. Most organizations that have “deployed a model” are sitting at Level 0 — which means a data scientist handed off a pickle file, an engineer wrapped it in Flask, and nobody set up monitoring. This is not a criticism; it’s just the default outcome when teams optimize for shipping a model rather than operating one.

The good news: you don’t need to reach Level 2 to run a reliable production ML system. You need a handful of specific practices in place, in the right order.

Version Everything — Code, Data, and Models Together

The most common reproducibility failure is a model that can’t be retrained to its original performance because nobody logged which dataset version, preprocessing script, and hyperparameters produced it. Six months later, when you need to retrain because distribution has shifted, you’re starting over.

The fix is treating code, data, and model artifacts as a versioned unit. Git handles code. DVC or lakeFS handles dataset versions in object storage. MLflow or Weights & Biases handles experiment runs — parameters, metrics, and the artifact pointer that ties them together. Our MLOps tools map covers the options in each of these categories. The model registry is the handoff point: a registered model with a lineage trace back to its training run and dataset version.

Without this, your pipeline has no memory. With it, you can answer: “What data produced the model currently serving 40% of traffic, and what was its F1 score on the holdout set?” That question gets asked more often than you’d expect.

Automate the Pipeline, Not Just the Training Script

Level 1 in Google’s maturity model means your ML pipeline is orchestrated end-to-end — data extraction, validation, preprocessing, training, model validation, and deployment — triggered by a schedule, new data, or a detected performance drop. Not a cron job that kicks off a training script. The whole pipeline, with each stage and its failure points laid out in our machine learning pipeline guide.

Two gates matter most:

Data validation runs before training. If your input schema has changed, if a feature’s statistical distribution has shifted beyond a threshold, or if expected values are missing, the pipeline fails fast with a clear error rather than silently training a degraded model.

Model validation runs after training. The new model must beat the current production model on a holdout set before it gets promoted. No exceptions. This is the difference between continuous training and continuous gambling.

Airflow, Prefect, and Dagster are the common orchestration choices. Kubeflow Pipelines if you’re Kubernetes-native. The tool matters less than enforcing the validation gates.

Monitor Drift Before It Kills Your Metrics

Production models degrade. The question is whether you find out from a monitoring alert or from a product manager noticing a KPI drop.

Two drift types matter operationally:

Data drift — the distribution of your input features shifts away from the training distribution. A model trained on last year’s user behavior gets production requests from a different user population. It still runs. It just stops being accurate.

Concept drift — the underlying relationship between features and target changes. A fraud detection model trained before a new payment method launched sees the new transaction patterns as anomalies, then adapts wrong.

The monitoring setup recommended by Evidently AI ↗ starts simple: compute a baseline distribution from your training data for key features (mean, variance, category proportions), then continuously compare production data against that baseline using statistical tests — Population Stability Index (PSI), Kolmogorov-Smirnov tests, or Jensen-Shannon divergence. When a feature’s drift score crosses a threshold, alert. The full set of signals worth tracking is laid out in our model monitoring guide and the drift, data-quality, and decay metrics taxonomy.

Start with ad-hoc drift reports to understand what “normal” drift looks like for your specific model and data. Then move to scheduled reports via Airflow or Prefect, aggregated into a dashboard. Add alerting last, once you understand what threshold actually warrants a page versus just a Slack message. For a deeper rundown of monitoring patterns, alerting thresholds, and tooling comparisons, mlmonitoring.report ↗ covers this continuously.

Define Retraining Triggers Before You Need Them

The worst time to figure out your retraining policy is when your model is already degraded. Define it upfront:

Scheduled retraining: every N days, regardless of observed drift. Simple, predictable, doesn’t require a monitoring system. Works for slowly-changing domains.
Performance-triggered retraining: retrain when a model performance metric — accuracy, precision, AUC — drops below a threshold on your evaluation set or on labeled production samples.
Drift-triggered retraining: retrain when input distributions shift beyond a threshold, even before you see performance degrade. This is predictive rather than reactive, but it requires confidence in your drift detection.

Most production systems use a combination: scheduled retraining as a floor, drift/performance triggers as an override. The key is that retraining goes through the same validation pipeline as the original training run — not a shortcut that skips the model validation gate.

Treat Infrastructure as Code

Manually configured training clusters and serving infrastructure are a reliability liability. When something breaks, you’re debugging both the model and the environment simultaneously. Terraform, Kubernetes manifests, and containerized pipeline components give you reproducible infrastructure that can be version-controlled, reviewed, and rolled back.

Containerizing pipeline components also prevents training-serving skew — the silent killer where the preprocessing logic in your training pipeline diverges from the preprocessing logic in your serving layer. Same container image, same code, same behavior.

For teams evaluating which MLOps platforms handle infrastructure management well, mlopsplatforms.com ↗ maintains updated comparisons of managed offerings versus open-source stacks.

What to Instrument First

If you’re starting from scratch, the order is: versioning first (you need lineage before anything else), then pipeline orchestration with validation gates, then drift monitoring on your top-five most predictive features before you try to monitor everything. Add retraining automation after you’ve observed at least one real drift event and understand what triggered it.

The one metric worth putting on a dashboard before anything else: the distribution distance between your training data and current production data, updated daily. When that number starts climbing, everything else follows.

Sources

MLOps: Continuous Delivery and Automation Pipelines in Machine Learning ↗ — Google Cloud’s canonical MLOps maturity framework covering Levels 0, 1, and 2 — from manual deployments to fully automated CI/CD pipelines. Defines the key components of production ML pipelines including data validation, model validation, and continuous training triggers.

How to Start with ML Model Monitoring ↗ — Evidently AI’s step-by-step guide to building an ML monitoring pipeline, covering drift detection methods, metric selection, dashboard setup, and progressive automation from ad-hoc reports to alerting. Practical and tool-specific.

8 MLOps Best Practices You Should Implement in 2026 ↗ — Practitioner-focused breakdown of eight core MLOps practices: versioning, CI/CD automation, post-deployment monitoring, governance, reproducibility, infrastructure-as-code, retraining planning, and cross-team collaboration. Published February 2026.

MLOps Best Practices: What Keeps Models Running in Production

Version Everything — Code, Data, and Models Together

Automate the Pipeline, Not Just the Training Script

Monitor Drift Before It Kills Your Metrics

Define Retraining Triggers Before You Need Them

Treat Infrastructure as Code

What to Instrument First

Sources

Sources

SentryML — in your inbox

Related

Machine Learning Pipeline: Stages, Failure Points, and Monitoring

OpenAI Tops Gartner's Coding-Agent Quadrant. Now You Own a Production ML System.

The ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model Decay

Comments