ML Testing: A Checklist from Pre-Train Checks to Production Drift

Most ml testing failures are silent. The pipeline runs green, the loss converges, the model gets promoted to production — then accuracy quietly erodes over the next six weeks because nobody checked whether the input distribution shifted, or whether the model still handles the edge cases it handled six months ago. Occasionally, the model was wrong from day one but the offline metrics looked fine.

ML testing is not a single gate before deployment. It’s a layered practice that spans pre-training sanity checks, post-training behavioral validation, data integrity enforcement, and continuous production monitoring. Here’s what each layer looks like and what to put in your runbook. Testing sits inside the broader machine learning pipeline; if your models are LLMs rather than tabular, the eval and gating story shifts to our LLM testing guide.

Pre-Train Checks: Catch Problems Before the GPU Bill

Pre-train tests run before you commit to a full training job. They’re cheap to execute and catch a surprisingly high share of bugs.

Output shape validation. Confirm that your model’s output tensor shape matches your label shape. A mismatch between (batch, num_classes) and (batch,) will silently produce a wrong loss that still decreases during training — so gradient descent confidently optimizes the wrong objective.

Loss sanity check. For a randomly initialized model, the starting loss should be close to the theoretical value: ln(num_classes) for cross-entropy classification, 0.5 for binary cross-entropy. A value far outside this range almost always means a data preprocessing bug before training has even started.

Single-batch overfit check. Train for 20–50 iterations on a single batch with regularization disabled. The model should memorize that batch and drive loss to near-zero. If it can’t, the architecture or optimizer is broken before you’ve spent real compute.

Data leakage scan. Confirm row indices in your validation and test splits don’t overlap with training. Time-series data requires temporal splits; shuffled splits on timeseries data are a common source of offline metrics that collapse entirely at deployment.

Made With ML’s MLOps course ↗ recommends encoding these as pytest fixtures that run in CI before any training job is dispatched — keeping failure feedback under a minute rather than hours. The Arrange-Act-Assert pattern maps cleanly: arrange the model and inputs, act by running a forward pass or a short training loop, assert on shape, loss value, or gradient norm.

Behavioral Tests: Treat the Model as a Black Box

After training, switch from introspective checks to behavioral ones. Behavioral tests treat the model as a black box and assert what it should do, not how it’s built internally.

Jeremy Jordan’s framework ↗ identifies three categories that belong in every model test suite:

Invariance tests. Inputs that differ in ways that shouldn’t affect prediction should produce the same output. A sentiment classifier shouldn’t flip from positive to negative because you replaced a character name. A fraud model shouldn’t change its score because you reformatted a phone number. Encode these as parameterized pytest cases with a fixed tolerance on the output delta.

Directional expectation tests. Some input changes should push the output in a predictable direction without asserting an exact value. A house price model should predict a higher price if you add a bathroom, holding all else equal. A churn model should predict higher churn if you zero out a customer’s last-login date. These tests are stable across model versions because they test direction, not magnitude.

Minimum functionality tests. Curate a labeled dataset of cases the model must never get wrong: high-confidence easy examples, high-stakes edge cases, and every failure mode you’ve already found in production. Run this suite on every model candidate before promotion. Think of it as a regression suite for model behavior. Jordan’s key distinction: model evaluation summarizes aggregate metrics; model testing makes explicit assertions about specific behaviors. Both are necessary.

Organize tests by “skill” rather than by code path. A suite called test_robustness_to_missing_fields is more maintainable six months later than test_model_forward_pass_v2. For models exposed to adversarial inputs, adversarialml.dev ↗ tracks attack patterns and defenses that can inform what belongs in your invariance suite.

Data Validation: The Layer Most Pipelines Skip

Model behavioral tests check model outputs. Data validation checks that inputs are sane before they reach the model.

Use Great Expectations or a similar schema-enforcement library to assert:

No null values in required columns
Numerical features within expected ranges (catches upstream pipeline bugs before they corrupt a training run or inference batch)
Categorical features contain only known labels — unseen categories in production are a common silent failure in tree-based models
No duplicate primary keys in training data

Add these checks at two points: as a pre-training gate in CI, and as a pre-inference gate in your serving pipeline. The pre-training gate catches problems before they corrupt the model. The pre-inference gate catches distribution problems before they corrupt live predictions. Skipping the second gate is where most pipelines drop the ball.

Production Monitoring: Drift Detection as Continuous ML Testing

The hardest part of ml testing happens after deployment. Offline test results don’t transfer perfectly to production, and the production distribution drifts over time. This is the handoff point to model monitoring; the full set of drift, data-quality, and decay signals worth tracking is catalogued in our ML monitoring metrics taxonomy.

Data drift occurs when the statistical properties of production inputs diverge from training data. Evidently AI’s documentation ↗ covers the main detection approaches:

Summary statistics monitoring: Track mean, median, and variance per feature on a rolling window. Alert when a feature’s mean moves beyond two standard deviations from the reference period. Simple, low-compute, and catches large shifts quickly.
Statistical tests: Kolmogorov-Smirnov for numerical features, chi-square for categorical. Both produce p-values. On large datasets these tests become oversensitive — a 0.001 distributional shift will be flagged as statistically significant even if it has no practical impact on the model.
Distance metrics: Wasserstein distance, Jensen-Shannon divergence, and Population Stability Index (PSI) give a continuous drift score rather than binary pass/fail. PSI > 0.2 is a common production alert threshold. These scale better on high-volume inference than statistical hypothesis tests.

For batch inference, run drift reports on each batch before serving and log the aggregate drift score to your observability stack. For real-time inference, compute rolling statistics on a sliding window (e.g., last 1,000 requests) and alert when the drift score crosses threshold.

mlmonitoring.report ↗ and mlobserve.com ↗ both track tooling in this space, and our model monitoring tools comparison weighs them head-to-head. Open-source options — Evidently, Deepchecks — cover most use cases without requiring a platform purchase. Commercial platforms like Arize AI, Fiddler, and WhyLabs add label feedback loops and alerting integrations that matter at scale, once you’re operating multiple models with delayed ground truth.

One critical point from Fiddler’s overview ↗: input drift is often detectable well before you see a drop in business metrics. Monitoring feature distributions lets you catch problems in days rather than weeks, before silent degradation compounds into an incident.

What Goes in the Runbook

A minimal ML testing runbook has three gates:

Gate 1 — Pre-training (CI): Pre-train sanity checks + data validation suite. No training job dispatches if this fails. Runtime under 2 minutes.

Gate 2 — Pre-promotion (staging): Behavioral test suite (invariance + directional + minimum functionality) + offline evaluation against a held-out test set. Candidates must pass both. Failures block the promotion, not just raise a warning.

Gate 3 — Production monitoring (ongoing): Drift score per feature, aggregate drift alert (PSI threshold), performance metric tracking where labels are available. Page on-call when PSI > 0.2 on any feature in the top-10 by feature importance, or when prediction distribution shifts more than 15% from the reference window.

Running the first two gates in CI and wiring the third into your alerting stack gives you coverage across the full model lifecycle — no dedicated ML observability platform required on day one.

Sources

Testing Machine Learning Systems: Code, Data and Models — Made With ML (Anyscale) ↗: Comprehensive course module on code, data, and model testing for MLOps practitioners, covering pytest, Great Expectations, and behavioral testing patterns.
Effective testing for machine learning systems — Jeremy Jordan ↗: Foundational post distinguishing model evaluation from model testing, introducing the invariance/directional/minimum functionality framework.
What is data drift in ML, and how to detect and handle it — Evidently AI ↗: Technical deep-dive on drift types, detection methods, and statistical test selection from the team behind the Evidently open-source library.
How Are Machine Learning Models Tested? — Fiddler AI ↗: Overview of robustness, interpretability, and reproducibility testing with analysis of where production monitoring fills gaps left by offline testing.

ML Testing: A Checklist from Pre-Train Checks to Production Drift

Pre-Train Checks: Catch Problems Before the GPU Bill

Behavioral Tests: Treat the Model as a Black Box

Data Validation: The Layer Most Pipelines Skip

Production Monitoring: Drift Detection as Continuous ML Testing

What Goes in the Runbook

Sources

Sources

SentryML — in your inbox

Related

The ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model Decay

Model Monitoring for LLM Inference: Metrics Your APM Can't See

When Embedding-Based Defenses Fail in Multi-Agent LLMs

Comments