SentryML
Monitoring dashboard with multiple time-series charts
monitoring

The ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model Decay

A reference taxonomy of the signals that actually tell you a production ML system is failing — input drift, prediction drift, concept drift, data quality, and performance decay — and which open-source tool computes each one.

By Priya Anand · · 8 min read

The hardest part of ML monitoring is not building dashboards. It is knowing which number, when it moves, actually means your model is in trouble — and which numbers are noise that will page you at 3am for nothing. Most teams I review have inverted this: they alert on the metrics that are easy to compute (request latency, error rate, a single accuracy score on a stale holdout) and ignore the ones that actually predict production failure.

This post is a reference taxonomy. It organizes the signals worth monitoring into four families, defines each one precisely enough that you can implement it, and names the open-source tool that computes it. It is meant to be the page you come back to when you are deciding what to instrument, not a one-time read.

The four families are: input drift (the data coming in changed), prediction drift (the model’s outputs changed), concept drift (the relationship between inputs and the target changed), and data quality (the inputs are broken in ways that have nothing to do with distribution). Performance decay — the metric everyone wants — is downstream of all four, and the central operational problem is that you usually cannot measure it directly in production because the labels arrive late or never.

Family 1: Input drift

Input drift is a change in the distribution of the features your model receives, relative to a reference window (typically the training set or a recent stable period). It is the most-monitored and most-misunderstood family, because drift is not the same as degradation. A feature can drift substantially without the model losing any accuracy, and a model can decay badly while every input feature looks stable. Input drift is a leading indicator worth watching, not a verdict.

The signals worth computing:

  • Per-feature distributional distance. For each feature, compare the production distribution to the reference. For numeric features, common choices are the Kolmogorov-Smirnov statistic, the Wasserstein (earth-mover’s) distance, and Population Stability Index (PSI). For categorical features, use Jensen-Shannon divergence or a chi-squared test. Evidently picks a default test per feature type and sample size, which is a reasonable starting heuristic before you tune.
  • Multivariate drift. Per-feature tests miss correlation shifts: each feature’s marginal looks fine while their joint distribution moved. The practical approaches are a domain classifier (train a model to distinguish reference from production samples; if it succeeds, the distributions differ) and PCA-reconstruction-error drift, which NannyML implements as its data-reconstruction method.
  • Share of drifting features. A single aggregate: what fraction of monitored features crossed their drift threshold this window. This is the number to put on a dashboard and alert on, not 200 individual feature charts.

A note on thresholds. PSI has folklore thresholds (>0.1 “moderate,” >0.25 “significant”) that are fine as defaults but should be recalibrated against your own baseline false-positive rate, exactly as you would tune any detection rule. Drift tests on large samples will flag statistically significant differences that are operationally meaningless; prefer effect-size measures (Wasserstein, PSI) over p-value tests at high volume.

Family 2: Prediction drift

Prediction drift is a change in the distribution of the model’s outputs — the predicted class probabilities, the regression outputs, or the rate of each predicted label. It is cheaper and faster than input drift to compute (one dimension, the output, instead of N features) and it is closer to the thing you care about.

The signals:

  • Output distribution distance. Same distance measures as input drift, applied to the prediction. A classifier whose positive-rate jumps from 4% to 19% with no change in upstream business volume is telling you something — either the input changed in a way your per-feature tests missed, or the model is responding to a genuine population shift.
  • Confidence/score distribution. Track the distribution of the model’s confidence (max softmax probability, or the score margin). A collapse toward the decision boundary — more predictions clustered near 0.5 — often precedes accuracy loss even when the predicted-label rate looks stable.
  • Prediction drift as an early-warning proxy. Because predictions are available in real time and labels are not, prediction drift is frequently the earliest measurable signal of trouble. Treat a sustained prediction-distribution shift as a trigger to investigate, then confirm with whatever delayed-label signal you can get.

Prediction drift has the same trap as input drift: it can move for legitimate reasons (real seasonality, a real change in the population). The discipline is to maintain a reference that reflects healthy variation, not a single frozen snapshot from training day.

Family 3: Concept drift

Concept drift is the one that actually hurts. It is a change in the relationship between inputs and the target — the function the model approximated has moved, so the same input now maps to a different correct output. Fraud patterns evolve, user behavior shifts, an upstream system changes what a field means. Critically, concept drift can occur with zero input drift: the inputs look identical, but the right answer changed.

The signals:

  • Performance metrics, when labels exist. If you get ground truth (even delayed), the gold-standard signal is the trend in your task metric — accuracy, F1, AUC, RMSE — computed on rolling production windows against a confidence interval, not a single point estimate. A 2-point AUC drop inside the noise band is not an incident; a sustained slide outside it is.
  • Estimated performance, when labels do not exist. This is the common case, and it is where NannyML is purpose-built. Its Confidence-Based Performance Estimation (CBPE) uses the model’s own calibrated probabilities to estimate metrics like ROC-AUC before labels arrive; its Direct Loss Estimation (DLE) targets regression. These are estimates with assumptions (chiefly that the model stays well-calibrated and that no concept drift invalidates the calibration), so treat them as a strong proxy that you reconcile against true labels when they land.
  • Error-region analysis. When labels do arrive, do not just track the aggregate metric — slice it. Concept drift often concentrates in a segment (one region, one device class, one cohort) while the global metric barely moves. The segment that breaks first is the most actionable signal you have.

The honest framing: input and prediction drift are observable immediately but only correlate with harm; concept drift is what causes harm but is observable only with labels or estimation. A monitoring program that watches only the cheap upstream signals is watching shadows. One that waits for labeled performance is always late. You need both layers, wired together.

Family 4: Data quality

Data quality failures are not distribution shifts — they are the pipeline being broken. They are the single most common cause of real production ML incidents, and they are the cheapest to catch, which makes ignoring them inexcusable.

The signals:

  • Schema and type conformance. Did a column disappear, change type, or get renamed upstream? A float field arriving as a string, or a feature that is suddenly 100% null, breaks inference regardless of any model property.
  • Missing-value and null rates per feature, against an expected baseline. A null rate jumping from 2% to 40% is a broken upstream join, not drift.
  • Range and constraint violations. Out-of-range numerics (a negative age, a price of zero), unseen categorical values, cardinality explosions. These are validation rules, and they should fail loud and early.
  • Volume and freshness. Row counts per batch and the recency of the newest record. A pipeline silently delivering yesterday’s data is a class of failure no distribution test will catch.
  • Training-serving skew. The same transformation applied differently in training versus serving — a normalization constant, a tokenizer version, a default-fill value. This is a quiet, high-impact failure that lives at the boundary between two codebases.

whylogs is built around this family: it computes lightweight statistical profiles (counts, null rates, distribution sketches, type counts) of your data without storing the raw records, which makes it well-suited to high-volume or privacy-sensitive pipelines where you cannot retain inputs. Evidently ships data-quality test suites alongside its drift reports.

What computes what

A practitioner’s mapping, current as of 2026. All four are open-source Python libraries; the right choice depends on whether your bottleneck is no-label performance estimation, online streaming detection, lightweight profiling, or breadth of reports.

ToolStrongest atNotes
EvidentlyBroad drift + data-quality reports, LLM evalPicks per-feature drift tests automatically; good first install
NannyMLPerformance estimation without labels (CBPE, DLE)The answer when ground truth is delayed or absent
Alibi DetectOnline/streaming drift, outlier, adversarial detectionAlgorithm depth for streaming; from the Seldon ecosystem
whylogsLightweight profiling at scale, privacy-sensitive loggingProfiles instead of raw data; drift from profiles

An empirical study of these tools (arXiv:2404.18673) is worth reading before you standardize on one: the authors find that different tools flag drift at different times on the same data, because they implement different statistics with different default thresholds. There is no neutral “drift detector” — there is a specific test with a specific sensitivity, and you own the choice.

How to assemble these into a monitoring layer

The taxonomy is not a shopping list to implement in full on day one. The priority order that holds across the deployments I have seen:

  1. Data quality first. It causes the most incidents and is the cheapest to catch. Schema, null rates, ranges, freshness, training-serving skew. If you instrument nothing else this quarter, instrument this.
  2. Prediction drift second. One dimension, real-time, the earliest measurable signal. Cheap to add, high signal-to-noise when you maintain a sane reference.
  3. Performance / concept drift third. Real performance metrics where labels exist; estimated performance (NannyML) where they don’t. This is the family that maps to actual harm, so it is worth the extra engineering even though it is the hardest.
  4. Input drift last, as diagnosis not alarm. Per-feature and multivariate drift are most useful for explaining a confirmed performance problem (“which features moved when AUC dropped”), and least useful as standalone pagers. Compute the share-of-drifting-features aggregate, alert on that, and keep the 200 per-feature charts for investigation.

Wire each signal to a severity and a runbook the same way you would any detection rule: a data-quality schema break is a hard fail that blocks the batch; a prediction-drift shift is an investigate-and-confirm; an estimated-performance drop below the confidence band is a page. Metrics without thresholds and runbooks are just charts, and charts do not catch anything.

Sources


→ This post is part of the ML Observability Hub — the complete index of ML monitoring and MLOps resources on SentryML. For instrumenting LLM and agent telemetry specifically, see our field guide to the OpenTelemetry GenAI semantic conventions.

Sources

  1. Evidently AI — What is Data Drift in ML
  2. NannyML — Performance Estimation Without Ground Truth
  3. Alibi Detect — Drift, Outlier, and Adversarial Detection
  4. whylogs — Data Logging and Profiling Library
  5. Open-Source Drift Detection Tools in Action (arXiv:2404.18673)
Subscribe

SentryML — in your inbox

ML observability & MLOps — model monitoring, drift detection, debugging in production. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments