ML Testing: A Checklist from Pre-Train Checks to Production Drift
ML testing spans pre-train sanity checks, behavioral validation, data integrity, and continuous drift monitoring. Here's what actually belongs in your CI pipeline and runbook.
Most ml testing failures are silent. The pipeline runs green, the loss converges, the model gets promoted to production — then accuracy quietly erodes over the next six weeks because nobody checked whether the input distribution shifted, or whether the model still handles the edge cases it handled six months ago. Occasionally, the model was wrong from day one but the offline metrics looked fine.
ML testing is not a single gate before deployment. It’s a layered practice that spans pre-training sanity checks, post-training behavioral validation, data integrity enforcement, and continuous production monitoring. Here’s what each layer looks like and what to put in your runbook.
Pre-Train Checks: Catch Problems Before the GPU Bill
Pre-train tests run before you commit to a full training job. They’re cheap to execute and catch a surprisingly high share of bugs.
Output shape validation. Confirm that your model’s output tensor shape matches your label shape. A mismatch between (batch, num_classes) and (batch,) will silently produce a wrong loss that still decreases during training — so gradient descent confidently optimizes the wrong objective.
Loss sanity check. For a randomly initialized model, the starting loss should be close to the theoretical value: ln(num_classes) for cross-entropy classification, 0.5 for binary cross-entropy. A value far outside this range almost always means a data preprocessing bug before training has even started.
Single-batch overfit check. Train for 20–50 iterations on a single batch with regularization disabled. The model should memorize that batch and drive loss to near-zero. If it can’t, the architecture or optimizer is broken before you’ve spent real compute.
Data leakage scan. Confirm row indices in your validation and test splits don’t overlap with training. Time-series data requires temporal splits; shuffled splits on timeseries data are a common source of offline metrics that collapse entirely at deployment.
Made With ML’s MLOps course ↗ recommends encoding these as pytest fixtures that run in CI before any training job is dispatched — keeping failure feedback under a minute rather than hours. The Arrange-Act-Assert pattern maps cleanly: arrange the model and inputs, act by running a forward pass or a short training loop, assert on shape, loss value, or gradient norm.
Behavioral Tests: Treat the Model as a Black Box
After training, switch from introspective checks to behavioral ones. Behavioral tests treat the model as a black box and assert what it should do, not how it’s built internally.
Jeremy Jordan’s framework ↗ identifies three categories that belong in every model test suite:
Invariance tests. Inputs that differ in ways that shouldn’t affect prediction should produce the same output. A sentiment classifier shouldn’t flip from positive to negative because you replaced a character name. A fraud model shouldn’t change its score because you reformatted a phone number. Encode these as parameterized pytest cases with a fixed tolerance on the output delta.
Directional expectation tests. Some input changes should push the output in a predictable direction without asserting an exact value. A house price model should predict a higher price if you add a bathroom, holding all else equal. A churn model should predict higher churn if you zero out a customer’s last-login date. These tests are stable across model versions because they test direction, not magnitude.
Minimum functionality tests. Curate a labeled dataset of cases the model must never get wrong: high-confidence easy examples, high-stakes edge cases, and every failure mode you’ve already found in production. Run this suite on every model candidate before promotion. Think of it as a regression suite for model behavior. Jordan’s key distinction: model evaluation summarizes aggregate metrics; model testing makes explicit assertions about specific behaviors. Both are necessary.
Organize tests by “skill” rather than by code path. A suite called test_robustness_to_missing_fields is more maintainable six months later than test_model_forward_pass_v2. For models exposed to adversarial inputs, adversarialml.dev ↗ tracks attack patterns and defenses that can inform what belongs in your invariance suite.
Data Validation: The Layer Most Pipelines Skip
Model behavioral tests check model outputs. Data validation checks that inputs are sane before they reach the model.
Use Great Expectations or a similar schema-enforcement library to assert:
- No null values in required columns
- Numerical features within expected ranges (catches upstream pipeline bugs before they corrupt a training run or inference batch)
- Categorical features contain only known labels — unseen categories in production are a common silent failure in tree-based models
- No duplicate primary keys in training data
Add these checks at two points: as a pre-training gate in CI, and as a pre-inference gate in your serving pipeline. The pre-training gate catches problems before they corrupt the model. The pre-inference gate catches distribution problems before they corrupt live predictions. Skipping the second gate is where most pipelines drop the ball.
Production Monitoring: Drift Detection as Continuous ML Testing
The hardest part of ml testing happens after deployment. Offline test results don’t transfer perfectly to production, and the production distribution drifts over time.
Data drift occurs when the statistical properties of production inputs diverge from training data. Evidently AI’s documentation ↗ covers the main detection approaches:
- Summary statistics monitoring: Track mean, median, and variance per feature on a rolling window. Alert when a feature’s mean moves beyond two standard deviations from the reference period. Simple, low-compute, and catches large shifts quickly.
- Statistical tests: Kolmogorov-Smirnov for numerical features, chi-square for categorical. Both produce p-values. On large datasets these tests become oversensitive — a 0.001 distributional shift will be flagged as statistically significant even if it has no practical impact on the model.
- Distance metrics: Wasserstein distance, Jensen-Shannon divergence, and Population Stability Index (PSI) give a continuous drift score rather than binary pass/fail. PSI > 0.2 is a common production alert threshold. These scale better on high-volume inference than statistical hypothesis tests.
For batch inference, run drift reports on each batch before serving and log the aggregate drift score to your observability stack. For real-time inference, compute rolling statistics on a sliding window (e.g., last 1,000 requests) and alert when the drift score crosses threshold.
mlmonitoring.report ↗ and mlobserve.com ↗ both track tooling in this space. Open-source options — Evidently, Deepchecks — cover most use cases without requiring a platform purchase. Commercial platforms like Arize AI, Fiddler, and WhyLabs add label feedback loops and alerting integrations that matter at scale, once you’re operating multiple models with delayed ground truth.
One critical point from Fiddler’s overview ↗: input drift is often detectable well before you see a drop in business metrics. Monitoring feature distributions lets you catch problems in days rather than weeks, before silent degradation compounds into an incident.
What Goes in the Runbook
A minimal ML testing runbook has three gates:
Gate 1 — Pre-training (CI): Pre-train sanity checks + data validation suite. No training job dispatches if this fails. Runtime under 2 minutes.
Gate 2 — Pre-promotion (staging): Behavioral test suite (invariance + directional + minimum functionality) + offline evaluation against a held-out test set. Candidates must pass both. Failures block the promotion, not just raise a warning.
Gate 3 — Production monitoring (ongoing): Drift score per feature, aggregate drift alert (PSI threshold), performance metric tracking where labels are available. Page on-call when PSI > 0.2 on any feature in the top-10 by feature importance, or when prediction distribution shifts more than 15% from the reference window.
Running the first two gates in CI and wiring the third into your alerting stack gives you coverage across the full model lifecycle — no dedicated ML observability platform required on day one.
Sources
- Testing Machine Learning Systems: Code, Data and Models — Made With ML (Anyscale) ↗: Comprehensive course module on code, data, and model testing for MLOps practitioners, covering pytest, Great Expectations, and behavioral testing patterns.
- Effective testing for machine learning systems — Jeremy Jordan ↗: Foundational post distinguishing model evaluation from model testing, introducing the invariance/directional/minimum functionality framework.
- What is data drift in ML, and how to detect and handle it — Evidently AI ↗: Technical deep-dive on drift types, detection methods, and statistical test selection from the team behind the Evidently open-source library.
- How Are Machine Learning Models Tested? — Fiddler AI ↗: Overview of robustness, interpretability, and reproducibility testing with analysis of where production monitoring fills gaps left by offline testing.
Sources
SentryML — in your inbox
ML observability & MLOps — model monitoring, drift detection, debugging in production. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
LLM Benchmarks in 2026: Which Still Discriminate, and How to Run
Static benchmarks like MMLU and HumanEval have saturated for frontier models. Here's which LLM benchmarks still produce signal, why contamination is worse than reported, and how to run your own reproducible evaluation with lm-evaluation-harness.
LLM Fine Tuning: Methods, Training Data, and Evaluation
A practitioner's guide to llm fine tuning — how to pick between SFT, LoRA, and DPO, what your training data actually needs, and how to validate a fine-tuned model before it hits production.
Choosing MLOps Tools: A Decision Framework for Production Teams
Picking the wrong MLOps tools costs months of migration work. Here's how to evaluate experiment tracking, orchestration, monitoring, and serving options against real selection criteria — not feature checklists.