Learning Objectives
- Formalize a backtest as an explicit trading protocol covering signal timing, execution, rebalancing, sizing, costs, constraints, data availability, and benchmark choice
- Distinguish vectorized and event-driven backtesting in terms of protocol semantics, state dependence, and appropriate use cases rather than treating one style as universally superior
- Build and interpret a transparent non-ML baseline strategy that provides a stable reference point for later model comparisons
- Evaluate a strategy using a core reporting stack that includes gross and net performance, drawdowns, turnover, baseline comparison, cost sensitivity, and regime-sliced diagnostics
- Assess whether a reported Sharpe ratio is credible by separating fixed-strategy estimation error from search-aware inference and applying tools such as confidence intervals, Reality Check logic, and the Deflated Sharpe Ratio
- Explain why prediction quality and trading quality can diverge, and why IC alone is insufficient for selecting deployable strategies
This section reframes backtesting from verification (does it look good?) to falsification (what would disprove it?). Drawing on Bailey et al. (2015) and Lopez de Prado (2018), it catalogs six recurrent failure modes -- lookahead bias, survivorship bias, data snooping, unrealistic execution, cost underestimation, and regime fragility -- and maps each to a specific diagnostic test. The section also distinguishes three complementary simulation frameworks (walk-forward, resampling, Monte Carlo) and argues that the default posture toward any backtest result should be disbelief until evidence survives multiple documented rejection attempts.
1 notebook
A backtest is only as informative as the trading protocol it enforces. This section defines six required protocol components -- signal timing, rebalancing frequency, position sizing, order types, cost model, and constraints -- and shows through a momentum strategy example how two researchers can report dramatically different Sharpe ratios (2.0 vs 0.5) from identical data simply by making different protocol choices. It also introduces cost sensitivity analysis as a protocol requirement: varying spread, slippage, and impact across plausible ranges to identify break-even levels, so that a strategy is never accepted based on a single optimistic cost assumption.
1 notebook
This section distinguishes two simulation representations: vectorized backtests that express the trading protocol as precomputed arrays, and event-driven backtests that track positions, cash, and orders bar by bar. Rather than declaring one approach superior, it identifies when each is appropriate -- vectorized for strategies whose protocol can be specified ex ante, event-driven when later actions depend on evolving state like realized P&L, margin, or contingent orders. A detailed table of ten simulation semantics (fill timing, order sequencing, fractional shares, cash handling, etc.) shows how even simple strategies produce different results across engines when these behaviors differ.
3 notebooks
Before evaluating ML models, this section constructs a transparent non-ML baseline: a momentum strategy with a yield-curve regime filter on ten liquid ETFs. The baseline is intentionally unsophisticated (6-month risk-adjusted momentum, top-3 equal-weight, monthly rebalancing at 5 bps per trade) and its reported Sharpe of 0.76 fails to beat a static 60/40 benchmark on total return. That deliberately unimpressive result is the point -- it provides an honest yardstick that every subsequent ML strategy must beat under identical protocol conditions, preventing researchers from comparing complex models against unrealistically weak straw men.
This section defines the standardized metric set used throughout the book, organized into return metrics (CAGR, cumulative return), risk metrics (volatility, maximum drawdown, Calmar ratio), risk-adjusted metrics (Sharpe, Sortino), and trading metrics (turnover, gross/net exposure). It devotes particular attention to Sharpe ratio inference, showing that autocorrelation can widen confidence intervals by 35% and that naive annualization fails under serial dependence. A NASDAQ-100 intraday case study illustrates the stakes: gross Sharpe of +1.76 collapses to -62.61 net, demonstrating that cost and timing assumptions dominate high-turnover strategies. - [`14_cross_dataset_signal_quality`](code/14_cross_dataset_signal_quality.ipynb)
1 notebook
Aggregate metrics can hide dangerous state dependence -- a strategy with a respectable overall Sharpe may profit only in bull markets while losing heavily in crises. This section presents a systematic regime diagnostic workflow using volatility and trend state variables to slice performance into four regimes. A simulation confirms the pattern: the tested strategy posts Sharpe +1.19 in bull markets but -0.27 in bear markets with 59% drawdowns. The section also warns against regime snooping (testing many regime definitions and selecting favorable ones) and requires pre-specified regime definitions reported in every backtest.
Strategy-level overfitting occurs when researchers test many parameterizations and report only the best, inflating observed Sharpe by the expected maximum of noise. This section presents three complementary defenses: White's Reality Check for family-wide inference, the Deflated Sharpe Ratio (DSR) which adjusts for non-normality, sample length, and search breadth, and Rademacher Anti-Serum (RAS) for correlated strategy families. It derives the minimum backtest length condition and shows that broad searches require materially higher observed Sharpe to achieve the same significance level. Across the book's case study sweeps, DSR materially changes conclusions for several candidates. - [`14_cross_dataset_signal_quality`](code/14_cross_dataset_signal_quality.ipynb)
1 notebook
Applying the chapter's falsification discipline to all nine case studies simultaneously, this section establishes three findings. First, the IC champion and the signal-stage Sharpe champion are different model families in most studies -- IC measures average rank accuracy while Sharpe depends on tail accuracy of the top-K selections. Second, no single model family dominates: GBM wins the most Sharpe races through implementation robustness, but most configurations within any family lose money. Third, rebalancing cadence mediates the IC-to-Sharpe translation more than any model choice, with monthly strategies converting modest IC into positive Sharpe while high-frequency strategies succeed only with classification labels that reduce unnecessary position changes. - [`14_cross_dataset_signal_quality`](code/14_cross_dataset_signal_quality.ipynb)
1 notebook
Related Case Studies
See where these chapter concepts get applied in end-to-end trading workflows.
ETF Cross-Asset Exposures
All six model families compared across 100 ETFs spanning 9 asset classes
Crypto Perpetuals Funding
Alternative data and non-standard frequencies in 24/7 crypto markets
NASDAQ-100 Microstructure
Intraday microstructure signals across 114 stocks at 15-minute frequency
S&P 500 Equity + Option Analytics
Combining options-derived features with equity data for multi-source prediction
US Firm Characteristics
Classic factor investing with ML on monthly fundamental data
FX Spot Pairs
Momentum and carry factors in the world's most liquid market
CME Futures
Carry signals across 30 products — data quality as the critical variable
S&P 500 Options (Straddles)
Direct options trading and why equity-style cost models fail for options
US Equities Panel
Large-scale cross-sectional prediction across 3,200 stocks with 16 walk-forward folds