All posts
-
Model Monitoring Tools in 2026: What's Changed, What to Use Now
The model monitoring tools landscape shifted in 2026 — WhyLabs shut down, LLM observability went mainstream, and open source caught up to managed SaaS. Here's the current map.
-
Predicting Model Behavior Before Release: What OpenAI's Deployment Simulation Means for MLOps
OpenAI's Deployment Simulation replays 1.3M real conversations through candidate models before release, hitting 1.5x median error on safety predictions and surfacing behaviors like 'calculator hacking' that conventional evals never find.
-
ML Model Deployment: Serving Frameworks, KV Cache, and the Latency Metrics That Matter
Once a model clears staging, the serving stack decision determines whether you hit your latency SLAs or spend a sprint chasing p99 spikes. Here's what to evaluate and what to instrument.
-
Replaying Production to Catch Drift: Inside OpenAI's Deployment Simulation Framework
OpenAI's deployment simulation replays 1.3M de-identified production conversations through a candidate model pre-release, catching behavior shifts static benchmarks miss. Here's how it works and what it means for teams running their own models.
-
Federated Learning in Production: What Substra Actually Does for Privacy-Preserving ML
Owkin's Substra framework keeps training data local while sharing only model weights — but federated architectures break standard MLOps assumptions around
-
OpenAI Tops Gartner's Coding-Agent Quadrant. Now You Own a Production ML System.
Gartner named OpenAI a Leader in its first Magic Quadrant for Enterprise AI Coding Agents. The operational story is the part the press release skips: a
-
The ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model Decay
A reference taxonomy of the signals that actually tell you a production ML system is failing — input drift, prediction drift, concept drift, data quality
-
OpenTelemetry GenAI Semantic Conventions: Instrument LLM Apps
How the OpenTelemetry GenAI semantic conventions standardize spans, metrics, and events for LLM apps, what they skip, and how to instrument without rework.
-
Model Monitoring in Production: A Four-Layer Framework
Model monitoring covers more than drift detection. Here's the four-layer framework — software health, data quality, model quality, business KPIs — wired
-
Model Monitoring for LLM Inference: Metrics Your APM Can't See
Model monitoring for LLM APIs requires a different metric set than traditional ML. Here's the signal hierarchy — TTFT, KV cache hit rate, output length
-
SmithDB and Five Other Things LangChain Shipped at Interrupt 2026
LangChain's Interrupt 2026 surfaced a purpose-built trace database, a context version-control system, and an automated failure-triage engine.
-
LLM Benchmarks in 2026: Which Still Discriminate, and How to Run
Static benchmarks like MMLU and HumanEval have saturated for frontier models. Here's which LLM benchmarks still produce signal, why contamination is worse
-
Watermarking Should Be Treated as a Monitoring Primitive
A new paper reframes LLM watermarking from an adversarial evasion problem into a monitoring infrastructure question.
-
LLM Fine Tuning: Methods, Training Data, and Evaluation
A practitioner's guide to llm fine tuning — how to pick between SFT, LoRA, and DPO, what your training data actually needs, and how to validate a
-
LLM Testing: A Guide to Evals, Metrics, and Production Monitoring
LLM testing spans offline evals, CI gate checks, and live production monitoring — three distinct jobs that need different tools.
-
ML Testing: A Checklist from Pre-Train Checks to Production Drift
ML testing spans pre-train sanity checks, behavioral validation, data integrity, and continuous drift monitoring.
-
Choosing MLOps Tools: A Decision Framework for Production Teams
Picking the wrong MLOps tools costs months of migration work. Here's how to evaluate experiment tracking, orchestration, monitoring, and serving options
-
When Embedding-Based Defenses Fail in Multi-Agent LLMs
A new arXiv paper shows that embedding-distance detectors miss three classes of adversarial agent. The fix lives in your observability stack, not your
-
LLM Benchmarks Explained: What the Numbers Mean and Miss
A practical guide to the major LLM benchmarks — MMLU, HumanEval, GPQA Diamond, SWE-bench — what they actually test, why saturation makes most scores
-
LLM Fine Tuning in Production: A Practical MLOps Guide
When to use LLM fine tuning over RAG, how LoRA and QLoRA cut GPU costs, and what to monitor after you ship a fine-tuned model — for ML engineers who own
-
Machine Learning Pipeline: Stages, Failure Points, and Monitoring
A practitioner's guide to the machine learning pipeline — from data ingestion to production monitoring — covering common failure points, drift types, and
-
ML Model Deployment: A Guide to Shipping Models That Stay Healthy
ML model deployment fails far more often than it should — typically before the model ever serves traffic. Here's what breaks, which deployment patterns
-
MLOps Best Practices: What Keeps Models Running in Production
A practitioner's guide to mlops best practices — from CI/CD pipeline automation and model versioning to drift detection and continuous retraining — based
-
MLOps Tools: A Practitioner's Map of the Production Stack
A category-by-category breakdown of MLOps tools — experiment tracking, orchestration, feature stores, serving, and monitoring — with honest tradeoffs for
-
Model Monitoring Tools: A Technical Comparison for ML Teams
Evidently, Arize, WhyLabs, Fiddler, NannyML, Alibi Detect — how each tool actually detects drift, what it costs to run, and which one fits your stack.
-
Model Monitoring in Production: What to Track and When to Act
A practical guide to model monitoring for ML engineers: drift types, the metrics that actually matter, handling the no-ground-truth problem, and which
-
OpenAI's DeployCo Pushes the Observability Problem Onto You
OpenAI's new $10B deployment subsidiary will build production AI systems inside enterprises. What that means for ML platform teams who inherit the runbook
-
Detection Engineering for LLM Apps: A MITRE ATLAS Runbook
Mapping LLM application telemetry to MITRE ATLAS techniques. Concrete log shapes, alerting heuristics, and a runbook structure that scales beyond ad-hoc
-
A Lean 4 Stability Proof for Tool-Mediated LLM Agents
A new arXiv paper certifies controllability and ISS robustness for an LLM-driven SOC agent using Lean 4. The MLOps takeaway is simpler than the math
-
The Agent Authority Gap Is an Observability Problem
Orchid Security's framing of agent governance as a delegation problem lands in the lap of ML observability teams.
-
Local Coding Assistants Crossed the Quality Bar: Now Observe Them
A practitioner's Reddit report on running Qwen3.6-27B locally signals a real inflection point. But moving off managed cloud APIs shifts monitoring