All posts

Model Monitoring Tools in 2026: What's Changed, What to Use Now

The model monitoring tools landscape shifted in 2026 — WhyLabs shut down, LLM observability went mainstream, and open source caught up to managed SaaS. Here's the current map.
June 21, 2026
Predicting Model Behavior Before Release: What OpenAI's Deployment Simulation Means for MLOps

OpenAI's Deployment Simulation replays 1.3M real conversations through candidate models before release, hitting 1.5x median error on safety predictions and surfacing behaviors like 'calculator hacking' that conventional evals never find.
June 21, 2026
ML Model Deployment: Serving Frameworks, KV Cache, and the Latency Metrics That Matter

Once a model clears staging, the serving stack decision determines whether you hit your latency SLAs or spend a sprint chasing p99 spikes. Here's what to evaluate and what to instrument.
June 20, 2026
Replaying Production to Catch Drift: Inside OpenAI's Deployment Simulation Framework

OpenAI's deployment simulation replays 1.3M de-identified production conversations through a candidate model pre-release, catching behavior shifts static benchmarks miss. Here's how it works and what it means for teams running their own models.
June 20, 2026
Federated Learning in Production: What Substra Actually Does for Privacy-Preserving ML

Owkin's Substra framework keeps training data local while sharing only model weights — but federated architectures break standard MLOps assumptions around
June 12, 2026
OpenAI Tops Gartner's Coding-Agent Quadrant. Now You Own a Production ML System.

Gartner named OpenAI a Leader in its first Magic Quadrant for Enterprise AI Coding Agents. The operational story is the part the press release skips: a
June 2, 2026
The ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model Decay

A reference taxonomy of the signals that actually tell you a production ML system is failing — input drift, prediction drift, concept drift, data quality
May 22, 2026
OpenTelemetry GenAI Semantic Conventions: Instrument LLM Apps

How the OpenTelemetry GenAI semantic conventions standardize spans, metrics, and events for LLM apps, what they skip, and how to instrument without rework.
May 22, 2026
Model Monitoring in Production: A Four-Layer Framework

Model monitoring covers more than drift detection. Here's the four-layer framework — software health, data quality, model quality, business KPIs — wired
May 15, 2026
Model Monitoring for LLM Inference: Metrics Your APM Can't See

Model monitoring for LLM APIs requires a different metric set than traditional ML. Here's the signal hierarchy — TTFT, KV cache hit rate, output length
May 15, 2026
SmithDB and Five Other Things LangChain Shipped at Interrupt 2026

LangChain's Interrupt 2026 surfaced a purpose-built trace database, a context version-control system, and an automated failure-triage engine.
May 13, 2026
LLM Benchmarks in 2026: Which Still Discriminate, and How to Run

Static benchmarks like MMLU and HumanEval have saturated for frontier models. Here's which LLM benchmarks still produce signal, why contamination is worse
May 13, 2026
Watermarking Should Be Treated as a Monitoring Primitive

A new paper reframes LLM watermarking from an adversarial evasion problem into a monitoring infrastructure question.
May 13, 2026
LLM Fine Tuning: Methods, Training Data, and Evaluation

A practitioner's guide to llm fine tuning — how to pick between SFT, LoRA, and DPO, what your training data actually needs, and how to validate a
May 11, 2026
LLM Testing: A Guide to Evals, Metrics, and Production Monitoring

LLM testing spans offline evals, CI gate checks, and live production monitoring — three distinct jobs that need different tools.
May 11, 2026
ML Testing: A Checklist from Pre-Train Checks to Production Drift

ML testing spans pre-train sanity checks, behavioral validation, data integrity, and continuous drift monitoring.
May 11, 2026
Choosing MLOps Tools: A Decision Framework for Production Teams

Picking the wrong MLOps tools costs months of migration work. Here's how to evaluate experiment tracking, orchestration, monitoring, and serving options
May 11, 2026
When Embedding-Based Defenses Fail in Multi-Agent LLMs

A new arXiv paper shows that embedding-distance detectors miss three classes of adversarial agent. The fix lives in your observability stack, not your
May 10, 2026
LLM Benchmarks Explained: What the Numbers Mean and Miss

A practical guide to the major LLM benchmarks — MMLU, HumanEval, GPQA Diamond, SWE-bench — what they actually test, why saturation makes most scores
May 10, 2026
LLM Fine Tuning in Production: A Practical MLOps Guide

When to use LLM fine tuning over RAG, how LoRA and QLoRA cut GPU costs, and what to monitor after you ship a fine-tuned model — for ML engineers who own
May 10, 2026
Machine Learning Pipeline: Stages, Failure Points, and Monitoring

A practitioner's guide to the machine learning pipeline — from data ingestion to production monitoring — covering common failure points, drift types, and
May 10, 2026
ML Model Deployment: A Guide to Shipping Models That Stay Healthy

ML model deployment fails far more often than it should — typically before the model ever serves traffic. Here's what breaks, which deployment patterns
May 10, 2026
MLOps Best Practices: What Keeps Models Running in Production

A practitioner's guide to mlops best practices — from CI/CD pipeline automation and model versioning to drift detection and continuous retraining — based
May 10, 2026
MLOps Tools: A Practitioner's Map of the Production Stack

A category-by-category breakdown of MLOps tools — experiment tracking, orchestration, feature stores, serving, and monitoring — with honest tradeoffs for
May 10, 2026
Model Monitoring Tools: A Technical Comparison for ML Teams

Evidently, Arize, WhyLabs, Fiddler, NannyML, Alibi Detect — how each tool actually detects drift, what it costs to run, and which one fits your stack.
May 10, 2026
Model Monitoring in Production: What to Track and When to Act

A practical guide to model monitoring for ML engineers: drift types, the metrics that actually matter, handling the no-ground-truth problem, and which
May 10, 2026
OpenAI's DeployCo Pushes the Observability Problem Onto You

OpenAI's new $10B deployment subsidiary will build production AI systems inside enterprises. What that means for ML platform teams who inherit the runbook
May 10, 2026
Detection Engineering for LLM Apps: A MITRE ATLAS Runbook

Mapping LLM application telemetry to MITRE ATLAS techniques. Concrete log shapes, alerting heuristics, and a runbook structure that scales beyond ad-hoc
May 6, 2026
A Lean 4 Stability Proof for Tool-Mediated LLM Agents

A new arXiv paper certifies controllability and ISS robustness for an LLM-driven SOC agent using Lean 4. The MLOps takeaway is simpler than the math
May 5, 2026
The Agent Authority Gap Is an Observability Problem

Orchid Security's framing of agent governance as a delegation problem lands in the lap of ML observability teams.
May 4, 2026
Local Coding Assistants Crossed the Quality Bar: Now Observe Them

A practitioner's Reddit report on running Qwen3.6-27B locally signals a real inflection point. But moving off managed cloud APIs shifts monitoring
May 2, 2026