SentryML
123Net Data Center (DC2)

Photo: 123net (CC BY-SA 3.0)

mlops

ML Testing: A Checklist from Pre-Train Checks to Production Drift

ML testing spans pre-train sanity checks, behavioral validation, data integrity, and continuous drift monitoring. Here's what actually belongs in your CI pipeline and runbook.

By SentryML Editorial · · 8 min read

Most ml testing failures are silent. The pipeline runs green, the loss converges, the model gets promoted to production — then accuracy quietly erodes over the next six weeks because nobody checked whether the input distribution shifted, or whether the model still handles the edge cases it handled six months ago. Occasionally, the model was wrong from day one but the offline metrics looked fine.

ML testing is not a single gate before deployment. It’s a layered practice that spans pre-training sanity checks, post-training behavioral validation, data integrity enforcement, and continuous production monitoring. Here’s what each layer looks like and what to put in your runbook.

Pre-Train Checks: Catch Problems Before the GPU Bill

Pre-train tests run before you commit to a full training job. They’re cheap to execute and catch a surprisingly high share of bugs.

Output shape validation. Confirm that your model’s output tensor shape matches your label shape. A mismatch between (batch, num_classes) and (batch,) will silently produce a wrong loss that still decreases during training — so gradient descent confidently optimizes the wrong objective.

Loss sanity check. For a randomly initialized model, the starting loss should be close to the theoretical value: ln(num_classes) for cross-entropy classification, 0.5 for binary cross-entropy. A value far outside this range almost always means a data preprocessing bug before training has even started.

Single-batch overfit check. Train for 20–50 iterations on a single batch with regularization disabled. The model should memorize that batch and drive loss to near-zero. If it can’t, the architecture or optimizer is broken before you’ve spent real compute.

Data leakage scan. Confirm row indices in your validation and test splits don’t overlap with training. Time-series data requires temporal splits; shuffled splits on timeseries data are a common source of offline metrics that collapse entirely at deployment.

Made With ML’s MLOps course recommends encoding these as pytest fixtures that run in CI before any training job is dispatched — keeping failure feedback under a minute rather than hours. The Arrange-Act-Assert pattern maps cleanly: arrange the model and inputs, act by running a forward pass or a short training loop, assert on shape, loss value, or gradient norm.

Behavioral Tests: Treat the Model as a Black Box

After training, switch from introspective checks to behavioral ones. Behavioral tests treat the model as a black box and assert what it should do, not how it’s built internally.

Jeremy Jordan’s framework identifies three categories that belong in every model test suite:

Invariance tests. Inputs that differ in ways that shouldn’t affect prediction should produce the same output. A sentiment classifier shouldn’t flip from positive to negative because you replaced a character name. A fraud model shouldn’t change its score because you reformatted a phone number. Encode these as parameterized pytest cases with a fixed tolerance on the output delta.

Directional expectation tests. Some input changes should push the output in a predictable direction without asserting an exact value. A house price model should predict a higher price if you add a bathroom, holding all else equal. A churn model should predict higher churn if you zero out a customer’s last-login date. These tests are stable across model versions because they test direction, not magnitude.

Minimum functionality tests. Curate a labeled dataset of cases the model must never get wrong: high-confidence easy examples, high-stakes edge cases, and every failure mode you’ve already found in production. Run this suite on every model candidate before promotion. Think of it as a regression suite for model behavior. Jordan’s key distinction: model evaluation summarizes aggregate metrics; model testing makes explicit assertions about specific behaviors. Both are necessary.

Organize tests by “skill” rather than by code path. A suite called test_robustness_to_missing_fields is more maintainable six months later than test_model_forward_pass_v2. For models exposed to adversarial inputs, adversarialml.dev tracks attack patterns and defenses that can inform what belongs in your invariance suite.

Data Validation: The Layer Most Pipelines Skip

Model behavioral tests check model outputs. Data validation checks that inputs are sane before they reach the model.

Use Great Expectations or a similar schema-enforcement library to assert:

Add these checks at two points: as a pre-training gate in CI, and as a pre-inference gate in your serving pipeline. The pre-training gate catches problems before they corrupt the model. The pre-inference gate catches distribution problems before they corrupt live predictions. Skipping the second gate is where most pipelines drop the ball.

Production Monitoring: Drift Detection as Continuous ML Testing

The hardest part of ml testing happens after deployment. Offline test results don’t transfer perfectly to production, and the production distribution drifts over time.

Data drift occurs when the statistical properties of production inputs diverge from training data. Evidently AI’s documentation covers the main detection approaches:

For batch inference, run drift reports on each batch before serving and log the aggregate drift score to your observability stack. For real-time inference, compute rolling statistics on a sliding window (e.g., last 1,000 requests) and alert when the drift score crosses threshold.

mlmonitoring.report and mlobserve.com both track tooling in this space. Open-source options — Evidently, Deepchecks — cover most use cases without requiring a platform purchase. Commercial platforms like Arize AI, Fiddler, and WhyLabs add label feedback loops and alerting integrations that matter at scale, once you’re operating multiple models with delayed ground truth.

One critical point from Fiddler’s overview: input drift is often detectable well before you see a drop in business metrics. Monitoring feature distributions lets you catch problems in days rather than weeks, before silent degradation compounds into an incident.

What Goes in the Runbook

A minimal ML testing runbook has three gates:

Gate 1 — Pre-training (CI): Pre-train sanity checks + data validation suite. No training job dispatches if this fails. Runtime under 2 minutes.

Gate 2 — Pre-promotion (staging): Behavioral test suite (invariance + directional + minimum functionality) + offline evaluation against a held-out test set. Candidates must pass both. Failures block the promotion, not just raise a warning.

Gate 3 — Production monitoring (ongoing): Drift score per feature, aggregate drift alert (PSI threshold), performance metric tracking where labels are available. Page on-call when PSI > 0.2 on any feature in the top-10 by feature importance, or when prediction distribution shifts more than 15% from the reference window.

Running the first two gates in CI and wiring the third into your alerting stack gives you coverage across the full model lifecycle — no dedicated ML observability platform required on day one.

Sources

Sources

  1. Testing Machine Learning Systems: Code, Data and Models — Made With ML (Anyscale)
  2. Effective testing for machine learning systems — Jeremy Jordan
  3. What is data drift in ML, and how to detect and handle it — Evidently AI
  4. How Are Machine Learning Models Tested? — Fiddler AI
#ml-testing #model-validation #drift-detection #mlops #data-quality
Subscribe

SentryML — in your inbox

ML observability & MLOps — model monitoring, drift detection, debugging in production. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments