The ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model Decay
A reference taxonomy of the signals that actually tell you a production ML system is failing — input drift, prediction drift, concept drift, data quality, and performance decay — and which open-source tool computes each one.
The hardest part of ML monitoring is not building dashboards. It is knowing which number, when it moves, actually means your model is in trouble — and which numbers are noise that will page you at 3am for nothing. Most teams I review have inverted this: they alert on the metrics that are easy to compute (request latency, error rate, a single accuracy score on a stale holdout) and ignore the ones that actually predict production failure.
This post is a reference taxonomy. It organizes the signals worth monitoring into four families, defines each one precisely enough that you can implement it, and names the open-source tool that computes it. It is meant to be the page you come back to when you are deciding what to instrument, not a one-time read.
The four families are: input drift (the data coming in changed), prediction drift (the model’s outputs changed), concept drift (the relationship between inputs and the target changed), and data quality (the inputs are broken in ways that have nothing to do with distribution). Performance decay — the metric everyone wants — is downstream of all four, and the central operational problem is that you usually cannot measure it directly in production because the labels arrive late or never.
Family 1: Input drift
Input drift is a change in the distribution of the features your model receives, relative to a reference window (typically the training set or a recent stable period). It is the most-monitored and most-misunderstood family, because drift is not the same as degradation. A feature can drift substantially without the model losing any accuracy, and a model can decay badly while every input feature looks stable. Input drift is a leading indicator worth watching, not a verdict.
The signals worth computing:
- Per-feature distributional distance. For each feature, compare the production distribution to the reference. For numeric features, common choices are the Kolmogorov-Smirnov statistic, the Wasserstein (earth-mover’s) distance, and Population Stability Index (PSI). For categorical features, use Jensen-Shannon divergence or a chi-squared test. Evidently ↗ picks a default test per feature type and sample size, which is a reasonable starting heuristic before you tune.
- Multivariate drift. Per-feature tests miss correlation shifts: each feature’s marginal looks fine while their joint distribution moved. The practical approaches are a domain classifier (train a model to distinguish reference from production samples; if it succeeds, the distributions differ) and PCA-reconstruction-error drift, which NannyML implements as its data-reconstruction method.
- Share of drifting features. A single aggregate: what fraction of monitored features crossed their drift threshold this window. This is the number to put on a dashboard and alert on, not 200 individual feature charts.
A note on thresholds. PSI has folklore thresholds (>0.1 “moderate,” >0.25 “significant”) that are fine as defaults but should be recalibrated against your own baseline false-positive rate, exactly as you would tune any detection rule. Drift tests on large samples will flag statistically significant differences that are operationally meaningless; prefer effect-size measures (Wasserstein, PSI) over p-value tests at high volume.
Family 2: Prediction drift
Prediction drift is a change in the distribution of the model’s outputs — the predicted class probabilities, the regression outputs, or the rate of each predicted label. It is cheaper and faster than input drift to compute (one dimension, the output, instead of N features) and it is closer to the thing you care about.
The signals:
- Output distribution distance. Same distance measures as input drift, applied to the prediction. A classifier whose positive-rate jumps from 4% to 19% with no change in upstream business volume is telling you something — either the input changed in a way your per-feature tests missed, or the model is responding to a genuine population shift.
- Confidence/score distribution. Track the distribution of the model’s confidence (max softmax probability, or the score margin). A collapse toward the decision boundary — more predictions clustered near 0.5 — often precedes accuracy loss even when the predicted-label rate looks stable.
- Prediction drift as an early-warning proxy. Because predictions are available in real time and labels are not, prediction drift is frequently the earliest measurable signal of trouble. Treat a sustained prediction-distribution shift as a trigger to investigate, then confirm with whatever delayed-label signal you can get.
Prediction drift has the same trap as input drift: it can move for legitimate reasons (real seasonality, a real change in the population). The discipline is to maintain a reference that reflects healthy variation, not a single frozen snapshot from training day.
Family 3: Concept drift
Concept drift is the one that actually hurts. It is a change in the relationship between inputs and the target — the function the model approximated has moved, so the same input now maps to a different correct output. Fraud patterns evolve, user behavior shifts, an upstream system changes what a field means. Critically, concept drift can occur with zero input drift: the inputs look identical, but the right answer changed.
The signals:
- Performance metrics, when labels exist. If you get ground truth (even delayed), the gold-standard signal is the trend in your task metric — accuracy, F1, AUC, RMSE — computed on rolling production windows against a confidence interval, not a single point estimate. A 2-point AUC drop inside the noise band is not an incident; a sustained slide outside it is.
- Estimated performance, when labels do not exist. This is the common case, and it is where NannyML ↗ is purpose-built. Its Confidence-Based Performance Estimation (CBPE) uses the model’s own calibrated probabilities to estimate metrics like ROC-AUC before labels arrive; its Direct Loss Estimation (DLE) targets regression. These are estimates with assumptions (chiefly that the model stays well-calibrated and that no concept drift invalidates the calibration), so treat them as a strong proxy that you reconcile against true labels when they land.
- Error-region analysis. When labels do arrive, do not just track the aggregate metric — slice it. Concept drift often concentrates in a segment (one region, one device class, one cohort) while the global metric barely moves. The segment that breaks first is the most actionable signal you have.
The honest framing: input and prediction drift are observable immediately but only correlate with harm; concept drift is what causes harm but is observable only with labels or estimation. A monitoring program that watches only the cheap upstream signals is watching shadows. One that waits for labeled performance is always late. You need both layers, wired together.
Family 4: Data quality
Data quality failures are not distribution shifts — they are the pipeline being broken. They are the single most common cause of real production ML incidents, and they are the cheapest to catch, which makes ignoring them inexcusable.
The signals:
- Schema and type conformance. Did a column disappear, change type, or get renamed upstream? A
floatfield arriving as a string, or a feature that is suddenly 100% null, breaks inference regardless of any model property. - Missing-value and null rates per feature, against an expected baseline. A null rate jumping from 2% to 40% is a broken upstream join, not drift.
- Range and constraint violations. Out-of-range numerics (a negative age, a price of zero), unseen categorical values, cardinality explosions. These are validation rules, and they should fail loud and early.
- Volume and freshness. Row counts per batch and the recency of the newest record. A pipeline silently delivering yesterday’s data is a class of failure no distribution test will catch.
- Training-serving skew. The same transformation applied differently in training versus serving — a normalization constant, a tokenizer version, a default-fill value. This is a quiet, high-impact failure that lives at the boundary between two codebases.
whylogs ↗ is built around this family: it computes lightweight statistical profiles (counts, null rates, distribution sketches, type counts) of your data without storing the raw records, which makes it well-suited to high-volume or privacy-sensitive pipelines where you cannot retain inputs. Evidently ↗ ships data-quality test suites alongside its drift reports.
What computes what
A practitioner’s mapping, current as of 2026. All four are open-source Python libraries; the right choice depends on whether your bottleneck is no-label performance estimation, online streaming detection, lightweight profiling, or breadth of reports.
| Tool | Strongest at | Notes |
|---|---|---|
| Evidently ↗ | Broad drift + data-quality reports, LLM eval | Picks per-feature drift tests automatically; good first install |
| NannyML ↗ | Performance estimation without labels (CBPE, DLE) | The answer when ground truth is delayed or absent |
| Alibi Detect ↗ | Online/streaming drift, outlier, adversarial detection | Algorithm depth for streaming; from the Seldon ecosystem |
| whylogs ↗ | Lightweight profiling at scale, privacy-sensitive logging | Profiles instead of raw data; drift from profiles |
An empirical study of these tools (arXiv:2404.18673) ↗ is worth reading before you standardize on one: the authors find that different tools flag drift at different times on the same data, because they implement different statistics with different default thresholds. There is no neutral “drift detector” — there is a specific test with a specific sensitivity, and you own the choice.
How to assemble these into a monitoring layer
The taxonomy is not a shopping list to implement in full on day one. The priority order that holds across the deployments I have seen:
- Data quality first. It causes the most incidents and is the cheapest to catch. Schema, null rates, ranges, freshness, training-serving skew. If you instrument nothing else this quarter, instrument this.
- Prediction drift second. One dimension, real-time, the earliest measurable signal. Cheap to add, high signal-to-noise when you maintain a sane reference.
- Performance / concept drift third. Real performance metrics where labels exist; estimated performance (NannyML) where they don’t. This is the family that maps to actual harm, so it is worth the extra engineering even though it is the hardest.
- Input drift last, as diagnosis not alarm. Per-feature and multivariate drift are most useful for explaining a confirmed performance problem (“which features moved when AUC dropped”), and least useful as standalone pagers. Compute the share-of-drifting-features aggregate, alert on that, and keep the 200 per-feature charts for investigation.
Wire each signal to a severity and a runbook the same way you would any detection rule: a data-quality schema break is a hard fail that blocks the batch; a prediction-drift shift is an investigate-and-confirm; an estimated-performance drop below the confidence band is a page. Metrics without thresholds and runbooks are just charts, and charts do not catch anything.
Sources
- Evidently AI — What is Data Drift in ML ↗ — practical reference for the per-feature drift tests and data-quality checks described above.
- NannyML ↗ — performance estimation without ground truth (CBPE for classification, DLE for regression), the tool for the no-label case.
- Alibi Detect ↗ — online/streaming drift, outlier, and adversarial detection algorithms.
- whylogs ↗ — statistical-profile-based data logging for lightweight, privacy-preserving monitoring.
- Open-Source Drift Detection Tools in Action (arXiv:2404.18673) ↗ — empirical comparison showing the same data produces different drift verdicts across tools.
→ This post is part of the ML Observability Hub — the complete index of ML monitoring ↗ and MLOps resources on SentryML. For instrumenting LLM and agent telemetry specifically, see our field guide to the OpenTelemetry GenAI semantic conventions.
Sources
SentryML — in your inbox
ML observability & MLOps — model monitoring, drift detection, debugging in production. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
A Field Guide to the OpenTelemetry GenAI Semantic Conventions
What the OpenTelemetry GenAI semantic conventions actually standardize — spans, events, and metrics for LLM and agent telemetry — what they don't yet cover, and how to instrument an LLM app against a moving spec without painting yourself into a corner.
LLM Testing: A Guide to Evals, Metrics, and Production Monitoring
LLM testing spans offline evals, CI gate checks, and live production monitoring — three distinct jobs that need different tools. Here's how to cover all three without drowning your team.
Machine Learning Pipeline: Stages, Failure Points, and Monitoring
A practitioner's guide to the machine learning pipeline — from data ingestion to production monitoring — covering common failure points, drift types, and the alerts that actually matter.