MLOps and Governance

6 sections 7 notebooks 16 references Code

Learning Objectives

Distinguish technical pipeline divergence from statistical performance decay and choose the corresponding diagnostic
Build a live-monitoring framework that combines data-integrity gates, rolling performance metrics, backtest-to-live
Apply drift diagnostics to production artifacts, including PSI, K-S, SHAP-based feature monitoring, and online
Design a safe model-update workflow using shadow mode, champion-challenger evaluation, explicit promotion criteria,
Implement multi-level circuit breakers across trade, strategy, portfolio, and system layers, with clear recovery and
Evaluate and right-size the supporting MLOps stack, including feature stores, data versioning and lineage, model

26.1

Two Sources of Live Trading Failure

This section establishes the fundamental distinction between technical failures (pipeline divergence where same inputs produce different outputs, addressed by verification in Chapter 25) and statistical failures (correct implementation but decayed predictions due to overfitting, look-ahead bias, regime change, or alpha crowding). The distinction matters operationally because treating statistical decay as a bug wastes debugging time, while treating bugs as decay leads to unnecessary model changes. Four mechanisms of statistical decay are identified, with the monitoring framework in subsequent sections designed around the assumption that decay will happen and must be detected early enough to respond.

2 notebooks

26.2

Performance Monitoring

The section builds a multi-layered monitoring framework starting with data integrity gates that catch silent data defects before they propagate, then rolling metrics (Sharpe ratio, IC, hit rate, drawdown) computed across multiple trailing windows to reveal trends that aggregate statistics hide. Tiered alert thresholds (watch, warning, critical) with defined response actions prevent both under-reaction and alert fatigue, while the backtest-to-live realization ratio tracks whether live performance matches expectations over time. Execution-quality monitoring for slippage, spread costs, fill ratios, and latency is treated as essential alongside model metrics, since worsening execution can masquerade as model decay.

26.3

Drift Detection

This section provides diagnostic tools that identify what changed when performance monitoring detects problems, covering three drift types: data drift measured by PSI and K-S tests on input feature distributions, feature drift tracked through SHAP value monitoring that reveals importance shifts even when distributions remain stable, and concept drift detected by ADWIN and DDM algorithms applied to prediction error streams. A four-quadrant diagnostic table crossing drift detection with performance decay guides response: drift with no decay means the model is robust, decay without detected drift means monitoring coverage is incomplete, and both together means retraining on recent data is warranted.

1 notebook

26.4

Safe Model Updates

The section presents a disciplined model update workflow combining scheduled and triggered retraining, shadow mode evaluation where challenger models process live data without trading, capital-capped A/B testing with gradual allocation increases, and explicit rollback procedures tested before deployment. Statistical rigor in promotion decisions requires the deflated Sharpe ratio or bootstrap comparison accounting for estimation error and multiple testing, with minimum effect size thresholds (0.2-0.3 Sharpe improvement) below which promotion is rejected regardless of statistical significance. When multiple challengers compete, White's Reality Check alongside multiple-testing corrections prevents selection bias from producing false promotion decisions.

1 notebook

26.5

Circuit Breakers and Safety

This section addresses sudden failure through four hierarchical circuit breaker levels: per-trade order validation, per-strategy exposure limits, portfolio-wide aggregate risk controls, and system-level infrastructure health monitoring, each operating independently so a breach at any level halts the relevant scope. Loss-based breakers with daily, weekly, and maximum drawdown limits are complemented by position concentration limits, anomaly-based triggers for extreme market conditions, and software circuit breakers using the CLOSED/OPEN/HALF_OPEN state machine pattern to prevent cascading infrastructure failures. Recovery procedures require explicit resume criteria, gradual restart at reduced capacity, and logged manual overrides with time limits and mandatory after-the-fact review.

1 notebook

26.6

MLOps Infrastructure Overview

The section surveys infrastructure tools supporting production governance, organized by maturity level: feature stores (Feast) for preventing training-serving skew through consistent feature definitions across offline training and online inference, data versioning (DVC) with run manifests for exact reproducibility, model registries (MLflow) for experiment tracking and staged deployment with auditability for regulatory review under SR 11-7. The right-sizing guidance progresses from minimal stacks suitable for solo practitioners through intermediate setups with Prometheus-Grafana monitoring to mature Kubernetes-based deployments, arguing that monitoring and safety controls from earlier sections matter more than tooling choices and should be established first.

2 notebooks

All Chapters

MLOps and Governance

Learning Objectives

Two Sources of Live Trading Failure

Performance Monitoring

Drift Detection

Safe Model Updates

Circuit Breakers and Safety

MLOps Infrastructure Overview

Training-Serving Skew, Point-in-Time Joins, and Feature Stores