Chapter 16
NEW

Strategy Simulation

8 sections 14 notebooks 16 references Code

Learning Objectives

  • Formalize a backtest as an explicit trading protocol covering signal timing, execution, rebalancing, sizing, costs, constraints, data availability, and benchmark choice
  • Distinguish vectorized and event-driven backtesting in terms of protocol semantics, state dependence, and appropriate use cases rather than treating one style as universally superior
  • Build and interpret a transparent non-ML baseline strategy that provides a stable reference point for later model comparisons
  • Evaluate a strategy using a core reporting stack that includes gross and net performance, drawdowns, turnover, baseline comparison, cost sensitivity, and regime-sliced diagnostics
  • Assess whether a reported Sharpe ratio is credible by separating fixed-strategy estimation error from search-aware inference and applying tools such as confidence intervals, Reality Check logic, and the Deflated Sharpe Ratio
  • Explain why prediction quality and trading quality can diverge, and why IC alone is insufficient for selecting deployable strategies
Figure 16.1
16.1

This section reframes backtesting from verification (does it look good?) to falsification (what would disprove it?). Drawing on Bailey et al. (2015) and Lopez de Prado (2018), it catalogs six recurrent failure modes -- lookahead bias, survivorship bias, data snooping, unrealistic execution, cost underestimation, and regime fragility -- and maps each to a specific diagnostic test. The section also distinguishes three complementary simulation frameworks (walk-forward, resampling, Monte Carlo) and argues that the default posture toward any backtest result should be disbelief until evidence survives multiple documented rejection attempts.

1 notebook

16.2

A backtest is only as informative as the trading protocol it enforces. This section defines six required protocol components -- signal timing, rebalancing frequency, position sizing, order types, cost model, and constraints -- and shows through a momentum strategy example how two researchers can report dramatically different Sharpe ratios (2.0 vs 0.5) from identical data simply by making different protocol choices. It also introduces cost sensitivity analysis as a protocol requirement: varying spread, slippage, and impact across plausible ranges to identify break-even levels, so that a strategy is never accepted based on a single optimistic cost assumption.

1 notebook

16.3

This section distinguishes two simulation representations: vectorized backtests that express the trading protocol as precomputed arrays, and event-driven backtests that track positions, cash, and orders bar by bar. Rather than declaring one approach superior, it identifies when each is appropriate -- vectorized for strategies whose protocol can be specified ex ante, event-driven when later actions depend on evolving state like realized P&L, margin, or contingent orders. A detailed table of ten simulation semantics (fill timing, order sequencing, fractional shares, cash handling, etc.) shows how even simple strategies produce different results across engines when these behaviors differ.

3 notebooks

16.4

Before evaluating ML models, this section constructs a transparent non-ML baseline: a momentum strategy with a yield-curve regime filter on ten liquid ETFs. The baseline is intentionally unsophisticated (6-month risk-adjusted momentum, top-3 equal-weight, monthly rebalancing at 5 bps per trade) and its reported Sharpe of 0.76 fails to beat a static 60/40 benchmark on total return. That deliberately unimpressive result is the point -- it provides an honest yardstick that every subsequent ML strategy must beat under identical protocol conditions, preventing researchers from comparing complex models against unrealistically weak straw men.

16.5

This section defines the standardized metric set used throughout the book, organized into return metrics (CAGR, cumulative return), risk metrics (volatility, maximum drawdown, Calmar ratio), risk-adjusted metrics (Sharpe, Sortino), and trading metrics (turnover, gross/net exposure). It devotes particular attention to Sharpe ratio inference, showing that autocorrelation can widen confidence intervals by 35% and that naive annualization fails under serial dependence. A NASDAQ-100 intraday case study illustrates the stakes: gross Sharpe of +1.76 collapses to -62.61 net, demonstrating that cost and timing assumptions dominate high-turnover strategies. - [`14_cross_dataset_signal_quality`](code/14_cross_dataset_signal_quality.ipynb)

1 notebook

16.6

Aggregate metrics can hide dangerous state dependence -- a strategy with a respectable overall Sharpe may profit only in bull markets while losing heavily in crises. This section presents a systematic regime diagnostic workflow using volatility and trend state variables to slice performance into four regimes. A simulation confirms the pattern: the tested strategy posts Sharpe +1.19 in bull markets but -0.27 in bear markets with 59% drawdowns. The section also warns against regime snooping (testing many regime definitions and selecting favorable ones) and requires pre-specified regime definitions reported in every backtest.

16.7

Strategy-level overfitting occurs when researchers test many parameterizations and report only the best, inflating observed Sharpe by the expected maximum of noise. This section presents three complementary defenses: White's Reality Check for family-wide inference, the Deflated Sharpe Ratio (DSR) which adjusts for non-normality, sample length, and search breadth, and Rademacher Anti-Serum (RAS) for correlated strategy families. It derives the minimum backtest length condition and shows that broad searches require materially higher observed Sharpe to achieve the same significance level. Across the book's case study sweeps, DSR materially changes conclusions for several candidates. - [`14_cross_dataset_signal_quality`](code/14_cross_dataset_signal_quality.ipynb)

1 notebook

16.8

Applying the chapter's falsification discipline to all nine case studies simultaneously, this section establishes three findings. First, the IC champion and the signal-stage Sharpe champion are different model families in most studies -- IC measures average rank accuracy while Sharpe depends on tail accuracy of the top-K selections. Second, no single model family dominates: GBM wins the most Sharpe races through implementation robustness, but most configurations within any family lose money. Third, rebalancing cadence mediates the IC-to-Sharpe translation more than any model choice, with monthly strategies converting modest IC into positive Sharpe while high-frequency strategies succeed only with classification labels that reduce unnecessary position changes. - [`14_cross_dataset_signal_quality`](code/14_cross_dataset_signal_quality.ipynb)

1 notebook