Chapter 19: Risk Management

Drift Detection and Trigger Design

A risk system that cannot detect when its own inputs have shifted is a system waiting to be surprised -- and the hardest part is not detecting drift but deciding what to do about it.

Drift Detection and Trigger Design

A risk system that cannot detect when its own inputs have shifted is a system waiting to be surprised -- and the hardest part is not detecting drift but deciding what to do about it.

The Intuition

A deployed trading strategy makes predictions based on learned relationships between inputs and outputs. Those relationships can change. Features that were predictive may lose their signal. Volatility regimes may shift. Correlations may break. When the data-generating process moves away from what the model was trained on, the model's predictions degrade -- sometimes gradually, sometimes abruptly.

Drift detection is the statistical machinery that monitors for these shifts in real time. But detection alone is not enough. A trigger that fires too often generates false alarms that erode returns through unnecessary position changes. A trigger that fires too late misses genuine regime breaks and lets losses accumulate. The design problem is calibrating the trigger to balance sensitivity against specificity, and connecting it to a decision protocol that specifies what happens when the alarm sounds.

The governance layer of a risk system typically introduces drift monitoring and kill switches but compresses the detection mechanics and threshold calibration. This primer fills the gap between "you should monitor for drift" and "here is how to build a monitoring system with known operating characteristics."

What Counts as Drift

Not all distributional changes are the same, and different types require different detection methods and imply different responses:

Feature drift: the distribution of model inputs shifts (e.g., a momentum signal's cross-sectional distribution changes). The model may still be correct conditional on the inputs, but the inputs now occupy a region where the model was poorly trained.
Concept drift: the relationship between inputs and outputs changes (e.g., momentum becomes negatively predictive after a regime break). The model is wrong even though the inputs look familiar.
Target drift: the distribution of the target variable itself changes (e.g., return volatility doubles). Even a well-calibrated model produces outputs that are mis-scaled for the new environment.

The foundational Drift Detection Method (DDM) monitors the error rate of an online learner and triggers an alarm when the error rate increases beyond a statistical threshold [ref:KMEBBQPM]. DDM establishes the two-level framework -- warning plus alarm -- that most subsequent methods adopt. Hinder, Vaquet, and Hammer (2023) provide a comprehensive taxonomy of unsupervised drift detection methods, covering both change-point and anomaly-detection approaches and noting that no single method dominates across all drift types [ref:IG9QK6UY].

Core Detection Methods

Population Stability Index (PSI)

PSI measures the divergence between a reference distribution and a recent window using a binned KL-divergence variant:

$$\text{PSI} = \sum_{i=1}^{B} (p_i - q_i) \ln \frac{p_i}{q_i}$$

where $p_i$ is the proportion of observations in bin $i$ for the reference window and $q_i$ is the proportion for the recent window. PSI is simple, interpretable, and widely used in credit and insurance model monitoring. Its main limitation is sensitivity to binning choices -- different bin boundaries can produce materially different PSI values for the same underlying shift.

Kolmogorov-Smirnov (K-S) test

A nonparametric comparison of two empirical CDFs. The test statistic is the maximum absolute difference between the two CDFs:

$$D = \sup_x |F_{\text{ref}}(x) - F_{\text{recent}}(x)|$$

K-S makes no distributional assumptions and produces well-calibrated p-values for continuous data. Its main limitation is low power against changes concentrated in the tails -- it is most sensitive to location and scale shifts, less so to tail-shape changes.

CUSUM (Cumulative Sum)

A sequential monitoring method that accumulates deviations from a target:

$$C_t = \max(0,\; C_{t-1} + x_t - \mu_0 - k)$$

where $\mu_0$ is the target mean (e.g., the expected forecast error), $k$ is an allowance parameter (slack), and an alarm fires when $C_t > h$ (the decision threshold). CUSUM is designed for early detection of persistent shifts and is tunable: larger $k$ reduces sensitivity to small shifts, larger $h$ delays detection but reduces false alarms.

CUSUM-based structural-break detection appears in Lopez de Prado's financial data-structure work, and Lipton and Lopez de Prado (2020) use the COVID-19 episode to illustrate how models calibrated to stale regimes fail catastrophically when structural breaks arrive [ref:ATYWAQKL].

ADWIN (Adaptive Windowing)

ADWIN maintains a variable-length window and shrinks it when a statistically significant change is detected between the older and newer portions of the window. It adapts automatically to drift speed without requiring a fixed reference window, making it useful for non-stationary streams where the baseline itself evolves.

Method Comparison

Method	Distributional assumptions	Detects	Strengths	Limitations
PSI	None (binned)	Location, scale, shape	Simple, interpretable	Bin-sensitive
K-S	None (continuous CDF)	Location, scale	Well-calibrated p-values	Weak on tail changes
CUSUM	Known target mean	Persistent mean shifts	Fast detection, tunable	Requires target specification
ADWIN	None (adaptive)	Any persistent change	Self-adapting window	Computationally heavier

Trigger Threshold Calibration

The fundamental tradeoff is between detection delay (how long after a genuine shift the trigger fires) and false alarm rate (how often it fires when nothing has changed).

A practical calibration procedure:

Set the false-alarm budget. Decide a tolerable false-alarm rate -- for example, one false alarm per quarter for a daily monitoring system.
Derive the threshold. For CUSUM, choose $h$ such that the expected run length under the null (no drift) matches the desired false-alarm interval. For K-S, set the p-value threshold to the corresponding significance level. For PSI, common thresholds are 0.1 (minor shift) and 0.25 (material shift), though these should be calibrated to the specific application.
Measure detection delay on historical breaks. Apply the calibrated trigger to known regime changes in the historical data and measure how quickly the alarm fires. If the delay is too long for the strategy's risk tolerance, tighten the threshold and accept a higher false-alarm rate.

Zhang, Guo, and Cao (2020) provide a concrete example: they monitor a stock-selection model's Information Coefficient using rolling-window mean tests and a binomial control chart that counts the proportion of recent windows where IC falls below a threshold [ref:ZWTK9SKK]. This dual-test approach -- one for level shift, one for deterioration frequency -- reduces false alarms while maintaining sensitivity to genuine signal decay.

Worked Example: CUSUM for Volatility-Forecast Monitoring

Suppose a risk system uses a volatility forecast $\hat{\sigma}_t$ and monitors the squared forecast error $e_t = (r_t / \hat{\sigma}_t)^2 - 1$, which should average zero if the forecast is well-calibrated.

Set up a CUSUM chart with target $\mu_0 = 0$, allowance $k = 0.5$, and threshold $h = 4$. Under normal conditions, $C_t$ fluctuates near zero. When the true volatility persistently exceeds the forecast -- as happens at the onset of a crisis -- $e_t$ becomes consistently positive, $C_t$ accumulates, and the alarm fires when $C_t > 4$.

On a historical backtest covering the February-March 2020 COVID volatility spike, this configuration might fire 5-8 trading days into the episode -- fast enough to trigger a risk review but not so sensitive that it alarms during routine volatility fluctuations.

Actionable Escalation

Detection is useless without a decision protocol. Define escalation levels tied to trigger severity:

Amber: flag for review, no position change. The monitoring system has detected a possible shift but the evidence is not yet conclusive.
Orange: reduce risk exposure to a pre-specified fraction. The drift signal is persistent and confirmed across multiple monitors.
Red: pause new entries, flatten to a defined safe state. The evidence indicates a structural break that invalidates the model's operating assumptions.

Each level must specify who decides, what evidence reverses the trigger, and how long the response persists before re-evaluation. This primer covers the detection and trigger design; the organizational decision rights -- who acts, escalation ladders, reinstatement conditions -- are a separate governance problem.

Common Mistakes

Monitoring only one type of drift. Feature drift can mask concept drift: the inputs may look stable while the input-output relationship has changed.
Using a fixed reference window that is never updated. If the strategy is genuinely adaptive, the reference distribution should evolve too.
Setting thresholds by looking at the full backtest and choosing the level that "would have caught" each historical break. This is in-sample optimization of the trigger itself.
Treating drift detection as a substitute for stress testing. Drift detection identifies gradual shifts; stress testing evaluates the impact of sudden shocks. Both are needed.

Connections

Book chapter: Chapter 19, which introduces drift monitoring and kill switches within the governance layer.
Related primers: Leakage-Safe Adaptive Risk Controls (Ch19/07) for ensuring the drift detector's own inputs are correctly time-indexed; Volatility Forecasting Mechanics (Ch19/02) for the forecasts being monitored; Forecast Evaluation with Noisy Volatility Proxies (Ch19/05) for evaluating whether forecast degradation is genuine or noise.

Register to Read

Create Free Account

Already have an account? Sign in

Chapter

19 Risk Management

More Primers

Stress Testing and Reverse Stress Testing for Systematic Portfolios Volatility Forecasting for Risk Control: EWMA, GARCH, QLIKE, and Proxy-Robust Evaluation