Federated Learning in Production: What Substra Actually Does for Privacy-Preserving ML
Owkin's Substra framework keeps training data local while sharing only model weights — but federated architectures break standard MLOps assumptions around
The scenario is familiar: your organization wants to train a model on patient records from five hospital systems, or on molecular bioactivity data owned by ten competing pharmaceutical companies. The data can’t leave its source — HIPAA, competitive sensitivity, or just basic liability. Centralization is off the table.
Substra ↗, the open-source federated learning framework from Owkin, was built specifically for these environments. The Hugging Face post walking through it is mostly a happy-path introduction, but it surfaces enough architecture detail to be worth unpacking for anyone thinking about what federated learning actually costs operationally.
What Substra Does
At its core, Substra coordinates distributed ML training across participant nodes without ever moving raw data. Only model weights travel between servers. Each data owner runs a compute node; those nodes exchange gradient updates or aggregated weights, never the underlying records.
The framework ships two Python interfaces:
- Substra SDK: low-level, for registering datasets, defining ML functions, and submitting tasks to individual nodes
- SubstraFL: high-level, for running full federated experiments at scale — implements algorithms like FedAvg out of the box
The backend is framework-agnostic. You can run PyTorch, TensorFlow, or sklearn on each node. The orchestrator handles coordination; a web frontend gives you experiment-level visibility. Every ML operation is logged to an auditable, append-only database.
The headline use case from Owkin is the MELLODDY project ↗: ten competing pharmaceutical companies pooling molecular data to build drug discovery models. None of them share raw data with each other. The federated model still beats what any single company could train on its own, precisely because the aggregate dataset is more diverse.
Where Standard MLOps Assumptions Break
This is where it gets operationally interesting, and where most Substra introductions gloss over the complexity.
In a centralized training pipeline, your data is in one place. You can profile it, run distribution checks, slice it by feature, and compare the training distribution to your serving distribution. The whole model monitoring discipline — Evidently, WhyLabs, Arize, and the rest of the model monitoring tooling landscape — assumes you can see the data or at least compute statistics over it.
In a federated setup, you cannot. The data lives on remote nodes you don’t control. This creates a specific class of monitoring blind spots.
Drift detection is harder by default. If one hospital’s imaging data starts drifting — new scanner firmware, seasonal case mix shift, population change — you may not see it in your global model metrics until the degradation is significant. The drift, data-quality, and decay metrics taxonomy that a centralized job computes for free has to be reconstructed from the node side: per-node drift monitoring requires the node operator to run local checks and report summary statistics (not raw data) back to the orchestrator. That has to be built explicitly; it does not happen automatically in Substra.
Non-IID data distribution is structural, not a bug. Federated datasets are inherently non-identically distributed across nodes. Hospital A sees mostly outpatient cases; Hospital B is a trauma center. This is the point — heterogeneous data makes for more generalizable models — but it means standard training loss curves will look noisy. You need per-node loss tracking to distinguish “the model is learning” from “node 3 has an anomalous data slice.”
Debugging remote failures is constrained. Substra provides access to remote logs without exposing the underlying data, which is the right design. But when a training task fails on a remote node, your debugging surface is limited to what those logs capture. In a centralized pipeline, you can reproduce the failure locally. In a federated environment, the data that caused the failure is off-limits.
Reproducibility requires coordination. Federated training results depend on what data was present at each node at training time. If a hospital updates its EHR schema between runs, your model’s behavior may shift in ways that aren’t visible from the orchestrator side. Versioning federated datasets requires each node operator to implement their own data versioning — you can’t impose this from the center.
What to Monitor Differently
If you’re running federated training in Substra, here’s what belongs in your runbook that wouldn’t be there for a centralized job. These additions sit on top of, not instead of, the usual machine learning pipeline failure points and ML model deployment checks.
Per-node participation rate. Track which nodes successfully completed each round of training. A node that silently drops out of aggregation because of a compute failure will skew your global model toward the remaining participants. Substra’s orchestrator logs task completion; surface this as an alert.
Per-node loss divergence. If one node’s training loss is consistently three standard deviations higher than the median, something is wrong — either the local data distribution shifted, or the node is running a different version of the preprocessing code. You can compute this without seeing raw data if each node reports its per-round loss metric.
Weight gradient norms. Before aggregation, check the L2 norm of each participant’s weight update. Extreme outliers can indicate corrupted local training (or, in adversarial settings, model poisoning). FedAvg with clipped gradient norms is a standard mitigation; SubstraFL lets you implement custom aggregation strategies if you need this.
Aggregation latency per round. Federated training rounds block on the slowest participant. Track round-over-round latency; a consistently slow node is either under-provisioned or silently failing and retrying. This shows up in the orchestrator logs but rarely in standard ML experiment tracking dashboards.
Model performance stratified by node. After each global model update, evaluate on a held-out validation set at each node and report summary statistics back. If the global model improves overall but degrades for one participant, you’ve made a tradeoff that the aggregate metric hides. This is especially important when participants represent meaningfully different subpopulations.
Combining with Other Privacy Mechanisms
The Hugging Face post mentions that Substra can be combined with secure enclaves and multi-party computation. This is correct but worth being specific about what each layer adds.
Federated learning alone is not cryptographically private. A sufficiently sophisticated adversary with access to multiple rounds of gradient updates can run gradient inversion attacks ↗ to reconstruct training samples. For high-sensitivity data (genomics, detailed medical imaging), federated learning by itself may not meet your threat model.
Differential privacy — adding calibrated noise to gradient updates before sharing — is the standard mitigation. It degrades model accuracy in exchange for formal privacy guarantees. SubstraFL doesn’t implement DP natively at the time of writing; you’d need to wrap your local training step with a library like Opacus (PyTorch) or TensorFlow Privacy. Substra’s auditable log gives you the chain of custody; DP gives you the mathematical bound.
Secure enclaves (Intel SGX, AMD SEV) add a hardware layer: computations run in attested trusted execution environments that even the node operator can’t inspect. This matters when you distrust the data node operators themselves, not just external attackers. The tradeoff is significant compute overhead and a reduced instruction set inside the enclave.
Production Readiness Signals
Substra claims to be “battle-tested in complex security environments,” and the MELLODDY project is credible evidence — ten pharma companies coordinating production training is not a toy. That said, a few operational signals worth checking before adopting:
- Does your team have an SLA path for debugging failures on a node you don’t control? You need a defined escalation process with each participant before a production incident surfaces it.
- Can your existing MLflow or W&B experiment tracking ingest per-node metrics, or do you need to build a custom logging layer?
- What happens when a participant leaves mid-training? Substra handles node dropout in FedAvg, but your monitoring needs to flag it before it silently skews the global model.
The Substra SDK and SubstraFL documentation covers the happy path thoroughly. The operational gaps — heterogeneous drift, reproducibility constraints, limited debugging surfaces — are the part worth pressure-testing in a staging environment before committing to a federated architecture for anything high-stakes.
Sources
Creating Privacy Preserving AI with Substra — Hugging Face Blog ↗ Owkin’s walkthrough of Substra covering the federated learning model, the MELLODDY use case, and an interactive demo space comparing federated vs. centralized training on non-uniform data. Good starting point for understanding the system design intent.
Substra Documentation — Architecture Overview ↗ Official docs covering the three-layer architecture (client SDK, server backend + orchestrator, monitoring frontend), the Substra SDK vs. SubstraFL split, and the auditable operation log. The reference for understanding what the framework does and does not handle natively.
Sources
SentryML — in your inbox
ML observability & MLOps — model monitoring, drift detection, debugging in production. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
MLOps Tools: A Practitioner's Map of the Production Stack
A category-by-category breakdown of MLOps tools — experiment tracking, orchestration, feature stores, serving, and monitoring — with honest tradeoffs for
Model Monitoring Tools in 2026: What's Changed, What to Use Now
The model monitoring tools landscape shifted in 2026 — WhyLabs shut down, LLM observability went mainstream, and open source caught up to managed SaaS. Here's the current map.
Model Monitoring Tools: A Technical Comparison for ML Teams
Evidently, Arize, WhyLabs, Fiddler, NannyML, Alibi Detect — how each tool actually detects drift, what it costs to run, and which one fits your stack.