Chapter 9: Model-Based Feature Extraction

Structural Break Diagnostics and Time-Since-Break Features

A break test is not asking whether the series is "bad." It is asking whether one stable model is still a reasonable description of the whole sample.

Structural Break Diagnostics and Time-Since-Break Features

A break test is not asking whether the series is "bad." It is asking whether one stable model is still a reasonable description of the whole sample.

The Intuition

A time series can look unstable for at least three different reasons:

it truly has a unit root
it is stationary but hit by one or more structural breaks
it moves across recurring regimes rather than one permanent shift

Those cases are not interchangeable. Structural-break diagnostics focus on the middle one:

did the data-generating process change in a discrete way, enough that a single stable model is no longer a good summary?

For feature engineering, break outputs are not just tests. They become usable features:

time since last break
current break score
break probability
pre/post-break mean or variance contrast

That is the bridge from theory to econometrics to ML to practitioner hacks.

Breaks Versus Unit Roots

A unit-root process drifts because shocks accumulate without full mean reversion. A broken stationary process looks unstable because its mean, trend, or variance changed at one or more dates.

That distinction matters because the remedy changes:

unit-root story: difference or transform the series
break story: keep the series but condition on the break history

This is why a break-aware diagnostic layer often matters more than repeating ADF/KPSS alone.

Classical One-Break Logic

The cleanest theory case is the Chow test when the break date is known in advance. Suppose

$$ y_t = x_t^\top \beta + \varepsilon_t, $$

and you want to test whether the coefficients are the same before and after date $\tau$. Then the null is

$$ H_0: \beta^{\text{pre}} = \beta^{\text{post}}. $$

This gives the basic regression view of a structural break: the model parameters changed.

In real trading data, the date is usually not known. That is where endogenous-break tests come in.

Zivot-Andrews: One Break Chosen from the Data

Zivot-Andrews extends the unit-root framework by allowing one break date to be selected endogenously. The practical question is:

does this series look like a unit root, or like a trend-stationary process with one major break?

The null is a unit root with no break. The alternative is a trend-stationary process with one endogenous break in level, trend, or both. That is useful when a single large dislocation is plausible. It is not a general solution for long samples with repeated instability.

Because the break date is searched over rather than fixed in advance, the critical values are not the same as in a standard ADF test.

The most important output for feature engineering is not the p-value alone. It is the candidate break date and the implied interpretation:

broken trend
broken level
or still plausibly unit-root-like

Bai-Perron: Multiple Break Segmentation

When longer samples contain several changes, Bai-Perron style segmentation is more useful. The idea is to partition a linear model into segments and fit stable parameter values within each segment:

$$ y_t = x_t^\top \beta_j + \varepsilon_t, \qquad t \in (\tau_{j-1}, \tau_j]. $$

The method searches for break dates $\tau_1, \dots, \tau_m$ that materially reduce within-segment residual variation, subject to minimum segment lengths and model penalties. In practice, the important knobs are the trimming rule for minimum segment length, the criterion for choosing the number of breaks, and whether you use a global segmentation or a simpler sequential search.

Why this is useful:

one-break tests force a false either/or
multi-break segmentation acknowledges that financial history often changes more than once

Useful features from this layer:

most recent estimated break date
number of breaks in the trailing history
contrast between pre- and post-break parameters
time since last break

Online Monitoring: CUSUM and Detection Delay

Retrospective segmentation uses the full sample. Live trading cannot.

For online monitoring, the problem is:

is instability accumulating right now, enough to distrust the old model?

CUSUM-style statistics accumulate deviations from a reference level or residual process. In spirit:

$$ C_t = \sum_{s=1}^t e_s, $$

where $e_s$ is a centered residual or forecast error. A sustained drift in one direction pushes the statistic away from zero.

Practical CUSUM implementations normalize those residuals and compare the running statistic to a boundary function rather than a naive fixed threshold, otherwise even stable noise can drift enough to create false triggers.

The key concept here is detection delay:

retrospective methods can place a break near where it truly happened
online detectors only trigger after enough evidence accumulates

In practice you choose a boundary or threshold, which creates the usual tradeoff:

lower threshold: faster detection, more false alarms
higher threshold: slower detection, fewer false alarms

That delay is not a failure. It is the price of causal detection.

ML Break Detection

When no single classical test works well, break detection can be reframed as a supervised problem.

Construct weak indicators of instability such as:

rolling mean shift
rolling variance shift
residual autocorrelation changes
CUSUM scores
distributional distance between recent and baseline windows

Then let a classifier aggregate them into an instability score or a calibrated break-probability feature.

This is where competition-style workflows become useful: no one statistic wins everywhere, so the operational solution is often an ensemble of weak detectors rather than a doctrinal commitment to one test.

The hard part is labels. In practice they often come from weak supervision, retrospective segmentations, synthetic breaks, or domain-event heuristics rather than a clean supervised target.

The output is not "the truth break date." It is a causal probability that recent data no longer look like the historical baseline.

A Worked Comparison

Consider three series:

a stable AR(1) process
the same AR(1) with a sudden mean shift
the same AR(1) with a gradual volatility ramp

Expected behavior:

Zivot-Andrews reacts strongly to the mean-shift case
Bai-Perron likely segments the shifted series cleanly and may also split the volatility-ramp case
CUSUM rises gradually in the ramp case and triggers only after evidence accumulates
an ML detector that combines mean, variance, and residual features may catch both the sharp shift and the slow instability better than any single classical test

For pure variance breaks, you would usually add variance-specific diagnostics such as CUSUM of squares or other variance-change tests rather than relying on mean-shift tools alone.

This is why the methods should be viewed as complements, not rivals.

From Diagnostics to Features

This is the part people often skip.

A break layer becomes useful when it emits stable downstream features such as:

$time_since_break$
$break_count_lookback$
$cusum_score$
$instability_score$
$mean_shift_magnitude$
$vol_shift_magnitude$

These are usually better as conditioning variables than as direct trading signals. A downstream model can learn that momentum behaves differently just after a break than six months later.

Practitioner Hacks

Real desks rarely run pure textbook break econometrics and stop there. Common hacks include:

require a minimum effect size before honoring a break
combine multiple weak detectors rather than trusting one p-value
debounce alerts so a detector must stay elevated for several observations
reset or shorten lookbacks after a confirmed break
maintain $days_since_last_alert$ even when the exact break date is uncertain
track false-alarm rates explicitly, because repeated monitoring creates alert fatigue fast

These hacks are not elegant, but they often solve the real operational problem: reacting to instability without whipsawing on noise.

What Not to Do

Two common errors:

Treat the estimated break date as if it were exact.
Use retrospective break segmentation directly as a live signal without acknowledging detection delay.

A break detector is best understood as a noisy state-of-instability sensor, not a divine timestamp oracle.

In Practice

Use the layers together:

one-break logic when one event is plausible
multi-break segmentation for long historical structure
CUSUM or related monitors for live detection
ML aggregation when instability shows up in several weak ways at once

And turn the outputs into conditioning features rather than overconfident labels.

Common Mistakes

Confusing a broken stationary process with a unit root.
Treating retrospective break dates as if they were live-safe.
Running repeated break tests without thinking about alert fatigue and multiple monitoring.
Using mean-break tools as if they automatically detect variance breaks too.
Assuming one test family is enough for all kinds of instability.
Using break detection only as a preprocessing decision instead of as a feature source.

Connections

Book chapters: Ch09 Model-Based Feature Extraction
Related primers: stationarity-tests.md, hidden-markov-models-and-regime-detection.md
Why it matters next: structural-break features connect directly to stationarity tests, regime models, walk-forward drift monitoring, and the broader practical question of how to notice that the old model of the world has stopped being a good one

Register to Read

Create Free Account

Already have an account? Sign in

Chapter

9 Model-Based Feature Extraction

More Primers

Autoregressive, Moving-Average, and ARIMA Foundations for Feature Engineering Bayesian Inference and MCMC for Time Series Fractional Differencing and Long Memory in Financial Features Path Signatures and Log-Signatures for Financial Sequences State-Space Models and the Kalman Filter Uncertainty as a Feature: Stochastic Volatility, Forecast Intervals, and Forecast Uncertainty Wavelets for Multi-Scale Diagnostics and Causal Feature Design