Topics
Browse posts by category and tag — every topic we cover, with the latest pieces under each.
Tags
- #mlops 30
- #monitoring 13
- #observability 9
- #model-monitoring 7
- #drift 6
- #drift-detection 6
- #tooling 6
- #evaluation 5
- #llm 5
- #llm-observability 3
- #agent-observability 2
- #agents 2
- #benchmarks 2
- #ci-cd 2
- #data-quality 2
- #fine-tuning 2
- #inference 2
- #llm-monitoring 2
- #llm-security 2
- #lora 2
- #model-deployment 2
- #model-selection 2
- #opentelemetry 2
- #orchestration 2
- #serving 2
- #agent-telemetry 1
- #attribution 1
- #blue-team 1
- #concept-drift 1
- #data-drift 1
- #data-governance 1
- #data-validation 1
- #deployment 1
- #deployment-simulation 1
- #detection-engineering 1
- #dpo 1
- #evals 1
- #evidently 1
- #experiment-tracking 1
- #feature-store 1
- #federated-learning 1
- #formal-methods 1
- #governance 1
- #identity 1
- #incident-response 1
- #infra 1
- #instrumentation 1
- #langsmith 1
- #latency 1
- #llm-safety 1
- #local-llm 1
- #metrics 1
- #mitre-atlas 1
- #ml-testing 1
- #mlops-tools 1
- #model-behavior 1
- #model-decay 1
- #model-drift 1
- #model-validation 1
- #multi-agent 1
- #openai 1
- #pipelines 1
- #platform-engineering 1
- #pre-deployment 1
- #pre-deployment-evaluation 1
- #privacy 1
- #production-ml 1
- #provenance 1
- #psi 1
- #retraining 1
- #runbook 1
- #safety 1
- #shadow-testing 1
- #siem 1
- #testing 1
- #tracing 1
- #ttft 1
- #versioning 1
- #vllm 1
- #watermarking 1
Categories
mlops 11 posts
- ML Model Deployment: Serving Frameworks, KV Cache, and the Latency Metrics That MatterOnce a model clears staging, the serving stack decision determines whether you hit your latency SLAs or spend a sprint chasing p99 spikes. Here's what to evaluate and what to instrument.
- LLM Benchmarks in 2026: Which Still Discriminate, and How to RunStatic benchmarks like MMLU and HumanEval have saturated for frontier models. Here's which LLM benchmarks still produce signal, why contamination is worse
- LLM Fine Tuning: Methods, Training Data, and EvaluationA practitioner's guide to llm fine tuning — how to pick between SFT, LoRA, and DPO, what your training data actually needs, and how to validate a
- ML Testing: A Checklist from Pre-Train Checks to Production DriftML testing spans pre-train sanity checks, behavioral validation, data integrity, and continuous drift monitoring.
- Choosing MLOps Tools: A Decision Framework for Production TeamsPicking the wrong MLOps tools costs months of migration work. Here's how to evaluate experiment tracking, orchestration, monitoring, and serving options
- LLM Benchmarks Explained: What the Numbers Mean and MissA practical guide to the major LLM benchmarks — MMLU, HumanEval, GPQA Diamond, SWE-bench — what they actually test, why saturation makes most scores
monitoring 9 posts
- OpenAI Tops Gartner's Coding-Agent Quadrant. Now You Own a Production ML System.Gartner named OpenAI a Leader in its first Magic Quadrant for Enterprise AI Coding Agents. The operational story is the part the press release skips: a
- The ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model DecayA reference taxonomy of the signals that actually tell you a production ML system is failing — input drift, prediction drift, concept drift, data quality
- OpenTelemetry GenAI Semantic Conventions: Instrument LLM AppsHow the OpenTelemetry GenAI semantic conventions standardize spans, metrics, and events for LLM apps, what they skip, and how to instrument without rework.
- Model Monitoring in Production: A Four-Layer FrameworkModel monitoring covers more than drift detection. Here's the four-layer framework — software health, data quality, model quality, business KPIs — wired
- Model Monitoring for LLM Inference: Metrics Your APM Can't SeeModel monitoring for LLM APIs requires a different metric set than traditional ML. Here's the signal hierarchy — TTFT, KV cache hit rate, output length
- Watermarking Should Be Treated as a Monitoring PrimitiveA new paper reframes LLM watermarking from an adversarial evasion problem into a monitoring infrastructure question.
deep-dive 5 posts
- Predicting Model Behavior Before Release: What OpenAI's Deployment Simulation Means for MLOpsOpenAI's Deployment Simulation replays 1.3M real conversations through candidate models before release, hitting 1.5x median error on safety predictions and surfacing behaviors like 'calculator hacking' that conventional evals never find.
- Replaying Production to Catch Drift: Inside OpenAI's Deployment Simulation FrameworkOpenAI's deployment simulation replays 1.3M de-identified production conversations through a candidate model pre-release, catching behavior shifts static benchmarks miss. Here's how it works and what it means for teams running their own models.
- When Embedding-Based Defenses Fail in Multi-Agent LLMsA new arXiv paper shows that embedding-distance detectors miss three classes of adversarial agent. The fix lives in your observability stack, not your
- OpenAI's DeployCo Pushes the Observability Problem Onto YouOpenAI's new $10B deployment subsidiary will build production AI systems inside enterprises. What that means for ML platform teams who inherit the runbook
- The Agent Authority Gap Is an Observability ProblemOrchid Security's framing of agent governance as a delegation problem lands in the lap of ML observability teams.
tooling 4 posts
- Model Monitoring Tools in 2026: What's Changed, What to Use NowThe model monitoring tools landscape shifted in 2026 — WhyLabs shut down, LLM observability went mainstream, and open source caught up to managed SaaS. Here's the current map.
- Federated Learning in Production: What Substra Actually Does for Privacy-Preserving MLOwkin's Substra framework keeps training data local while sharing only model weights — but federated architectures break standard MLOps assumptions around
- SmithDB and Five Other Things LangChain Shipped at Interrupt 2026LangChain's Interrupt 2026 surfaced a purpose-built trace database, a context version-control system, and an automated failure-triage engine.
- Model Monitoring Tools: A Technical Comparison for ML TeamsEvidently, Arize, WhyLabs, Fiddler, NannyML, Alibi Detect — how each tool actually detects drift, what it costs to run, and which one fits your stack.