This notebook demonstrates the complete diagnostic workflow for financial time series: visual inspection, stationarity tests, autocorrelation analysis, and rolling diagnostic features. Uses etfs, macro data.
02 Structural Breaks
This notebook demonstrates classical and ML-based methods for detecting structural breaks in financial time series. Uses etfs data.
03 Fractional Differencing
This notebook demonstrates fractional differentiation (FFD), a technique that achieves stationarity while preserving as much memory as possible. Uses etf_data, etfs data.
04 Kalman Filter
This notebook demonstrates the Kalman filter as a production feature extractor: level estimation, trend detection, innovation (surprise) signals, and dynamic hedge ratio estimation. Uses etfs data.
05 Spectral Features
This notebook demonstrates frequency-domain feature engineering: wavelet decomposition for multi-resolution analysis, rolling FFT for production spectral features, and Welch's method for robust power spectral density estimation. Uses etfs data.
06 Path Signatures
> Docker required: This notebook uses esig, which is an x86-only package > not included in the default environment. Run with: > `bash > docker compose --profile compat run --rm compat python 09_model_based_features/06_path_signatures.py > ` Uses etfs data.
07 Arima Features
This notebook demonstrates ARIMA as a feature extractor rather than a standalone forecaster. The key outputs — residuals, forecast values, and forecast uncertainty — feed into downstream ML pipelines.
08 Garch Volatility
This notebook extracts volatility features from GARCH family models: conditional volatility, persistence parameters, and leverage effects. Uses etfs, symbol_returns data.
09 Har Rough Volatility
This notebook covers multi-horizon volatility modeling and the Hurst exponent as features for ML trading systems. Uses etfs data.
10 Uncertainty Features
This notebook demonstrates Bayesian and frequentist approaches to extracting uncertainty features — posterior distributions and prediction intervals become ML inputs, not just diagnostics. Uses etfs data.
11 Hmm Regimes
This notebook provides a thorough introduction to HMMs for financial regime detection, from first principles through production considerations. Uses etfs, macro data.
12 Wasserstein Regimes
This notebook implements the methodology from "Clustering Market Regimes Using the Wasserstein Distance" (Horvath et al., 2021). Instead of clustering on moment features (mean, variance, skewness), each time window is treated as an empirical distribution and clustered using optimal transport.
13 Regime As Feature
This notebook demonstrates the regime-as-feature methodology: using regime probabilities as input features to ML models, rather than switching between specialized models based on detected regime. Uses etfs, macro data.
Mark B. Garman and Michael J. Klass (1980) — The Journal of Business · 1472 citations
Introduces a set of volatility estimators using Open, High, Low, and Close prices that are up to 8 times more efficient than standard close-to-close variance calculations.
Robert F. Engle (1983) — Journal of Money, Credit and Banking · 503 citations
This seminal paper introduces the Autoregressive Conditional Heteroscedasticity (ARCH) model to estimate the time-varying conditional variance of U.S. inflation, revealing that high inflation does not necessarily imply high unpredictability.
Tim Bollerslev (1986) — Journal of Econometrics · 23212 citations
This paper introduces the Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model, a significant extension of the ARCH model that allows for more flexible and parsimonious modeling of time-varying volatility by incorporating past conditional variances.
Robert F. Engle and C. W. J. Granger (1987) — Econometrica · 31736 citations
Engle and Granger (1987) formalize cointegration and prove that cointegrated I(1) variables must admit an error-correction representation, then provide practical two-step estimation and simulation-based cointegration tests with empirical macro/finance examples.
James D. Hamilton (1989) — Econometrica · 9717 citations
Hamilton (1989) introduces a maximum-likelihood Markov-switching autoregressive framework and nonlinear filter to infer unobserved regime changes, and shows U.S. real GNP growth is well-described by recurrent expansion/recession regimes with recessions implying an ~3% permanent level loss.
Søren Johansen and Katarina Juselius (1990) — Oxford Bulletin of Economics and Statistics · 11542 citations
This paper presents a maximum likelihood approach for estimating and testing cointegration relationships in vector autoregressive (VAR) models, with a focus on linear restrictions on cointegration vectors and weights, and illustrates the method using money demand data from Denmark and Finland.
Daniel B. Nelson (1991) — Econometrica · 10571 citations
This paper introduces Exponential GARCH (EGARCH) to address limitations of standard GARCH models, such as the inability to capture the negative correlation between returns and volatility, restrictive parameter constraints, and difficulties in interpreting volatility persistence.
Dennis Yang and Qiang Zhang (2000) — The Journal of Business · 470 citations
The authors introduce the 'Yang-Zhang' volatility estimator, which uses OHLC data to provide a minimum-variance estimate that is robust to both price trends (drift) and overnight gaps (opening jumps).
International Asset Allocation With Regime Shifts
Andrew Ang and Geert Bekaert (2002) — Review of Financial Studies · 1567 citations
Despite correlations rising in bear markets, international diversification remains economically valuable, particularly when investors can switch into cash (risk-free assets) during high-volatility regimes.
This paper introduces the Heterogeneous Autoregressive model of Realized Volatility (HAR-RV), a simple additive cascade model using volatility components across different time horizons, which effectively replicates key empirical features of financial returns like long memory and fat tails, while also demonstrating strong forecasting performance.
Andrew Ang and Allan Timmermann (2011) · 454 citations
This paper reviews how regime-switching models (HMMs) capture the abrupt, persistent changes in financial data (volatility clustering, skewness) that linear models miss, demonstrating that optimal portfolios must dynamically adjust to 'bull' and 'bear' states.
Matthew D. Hoffman and Andrew Gelman (2011) — arXiv:1111.4246 [cs, stat] · 4962 citations
This paper introduces NUTS, an extension of Hamiltonian Monte Carlo that automatically chooses trajectory length (and adaptively tunes step size), delivering HMC-level efficiency without hand-tuning.
This paper demonstrates that log-volatility behaves like a fractional Brownian motion with a Hurst exponent around 0.1, leading to the Rough FSV model, which aligns well with financial data and improves volatility forecasting.
Alan Moreira and Tyler Muir (2017) — The Journal of Finance · 381 citations
Scaling factor exposures each month by the inverse of last month’s realized variance produces large alphas and materially higher Sharpe ratios across many factors because volatility forecasts risk much more than it forecasts expected returns.
Advances in Financial Machine Learning
Marcos Lopez de Prado (2018) — John Wiley & Sons · 106 citations
Michael Betancourt (2018) — arXiv:1701.02434 [stat] · 1393 citations
This tutorial explains Hamiltonian Monte Carlo (HMC) through the geometry of the “typical set,” showing why gradient-informed, energy-conserving trajectories can explore high-dimensional posteriors far more efficiently—and how tuning/diagnostics (mass matrix, step size, trajectory length, divergences) make or break performance.
This paper introduces the Wasserstein k-means (WK-means) algorithm, a robust, non-parametric method for clustering financial time series into distinct market regimes by treating segments as probability distributions and using the p-Wasserstein distance, outperforming traditional moment-based k-means and HMMs, especially for non-Gaussian data.
A. Sinem Uysal and John M. Mulvey (2021) — The Journal of Financial Data Science · 20 citations
The paper uses supervised ML (especially random forests) to predict recessions and equity “crash” regimes from macro data and then uses these probabilities to improve risk parity portfolios via regime-aware covariance estimation and overlay trades.
Stephen Marra (2023) — The Journal of Portfolio Management
A practitioner-focused survey comparing common volatility forecasting models (historical, ARMA/GARCH, and option-implied) and showing why relatively simple, well-designed historical models can be robust inputs for volatility targeting and risk-parity allocation.
This paper introduces Conformal Prediction for Time-series with Change points (CPTC), a novel algorithm that integrates a model to predict underlying states with online conformal prediction to provide uncertainty quantification for time series data with change points, demonstrating improved validity and adaptivity compared to state-of-the-art baselines.
Yizhan Shu and John M. Mulvey (2025) — The Journal of Portfolio Management · 2 citations
A dynamic factor allocation strategy using Sparse Jump Models (SJM) to identify active return regimes improves the Information Ratio from 0.05 to ~0.45 compared to an equal-weighted benchmark.
This paper introduces the signature method, a way to transform time-ordered data into a set of features using iterated integrals, and discusses its theoretical properties and machine learning applications, including handwritten digit classification.
Michael Parkinson — The Journal of Business · 1938 citations
This paper introduces the extreme value method for estimating the variance of the rate of return of a common stock, demonstrating its superior efficiency compared to the traditional method using closing prices.