Chapter 20

Strategy Synthesis

9 sections 6 notebooks 11 references Code

Learning Objectives

  • Explain why the information coefficient is a useful entry metric for financial signals but does not translate
  • Distinguish signal quality, portfolio translation, cost survival, and temporal stability as separate stages in
  • Compare how major model families perform after the full pipeline, and identify when robustness matters more than peak
  • Diagnose holdout disappointment using distinct failure modes, including prediction decay, translation decay, and
  • Evaluate trading strategies under realistic implementation constraints, including instrument-appropriate cost models,
  • Identify the highest-return next steps after a first research pass, including label redesign, ensembling, feature
  • Apply a practitioner workflow that moves from data and diagnostics through signal generation, strategy construction,
Figure 20.1
20.1

The Nine Case Studies: From Signal to Verdict

Each of the nine case studies receives a verdict -- advance, iterate, or reframe -- based on where the pipeline first imposes a binding constraint. US firm characteristics produced the strongest integrated result (validation Sharpe +3.03, holdout +2.52) but is capacity-constrained in small caps. FX is the only study where holdout exceeds validation. CME futures is the data quality teaching case where a single back-adjustment choice cascades through three failures. S&P 500 options has a real ML signal destroyed by 1,091 basis point median round-trip spreads. Crypto shows complete holdout failure as a bull run reversed learned funding patterns. The verdicts demonstrate that raw signal quality never settles the case on its own.

2 notebooks

20.2

When IC Lies: Signal Quality Beyond the Information Coefficient

IC is dangerously incomplete as a verdict metric: NASDAQ-100 has the weakest IC in the book (0.008) but the highest Sharpe (4.22), while ETFs sees IC improve fivefold in holdout as Sharpe decays 55%. This section introduces a richer diagnostic bundle -- ICIR for consistency across folds, positive-fold share for regime dependence, and checkpoint sensitivity for deep learning selection risk. It also argues that label engineering (classification vs. regression, horizon choice, winsorization) may have higher return on investment than model architecture research, citing US firm characteristics where winsorizing extreme returns lifts GBM IC ninefold -- a larger effect than any model family difference.

1 notebook

20.3

How Predictions Become Profits: The IC-to-Sharpe Translation

The Fundamental Law of Active Management provides the framework for understanding why IC alone does not determine strategy performance: breadth -- the number of independent bets per period -- mediates the translation. NASDAQ-100 converts near-zero IC into the highest Sharpe through enormous intraday breadth, while FX is capped by a 20-pair universe regardless of model quality. The section shows that rebalancing cadence is the hidden multiplier (the same signal can be worthless or profitable at different frequencies), that moderate selectivity raises median Sharpe but aggressive concentration gives back gains, and that portfolio construction is the neglected middle where lower-IC models can outperform higher-IC models through better score-to-weight translation.

1 notebook

20.4

Robustness Beats Peak Signal: The Model Family Synthesis

GBM is the downstream champion in six of nine case studies even when it does not lead the IC table, because its lower variance, absence of checkpoint selection, and implementation robustness compound through the multi-stage pipeline from prediction to deployment. Deep learning finds its niche where signals live in temporal structure or nonlinear interactions -- CME futures shows the strongest nonlinearity diagnostic (negative linear IC, positive GBM IC) -- but an important coverage caveat applies: deep learning was not tested on the two strongest results. The practical default is to start with GBM everywhere and invest in deep learning only when positive evidence of structural nonlinearity exists.

20.5

Trading Realism: Costs, Capacity, and Execution

Five cost-survival tiers emerge from breakeven analysis, ranging from extremely robust (US firm characteristics survives above 100 bps) to fatal (S&P 500 options is negative at zero assumed friction). Cost fragility is largely predictable from two inputs: rebalancing cadence and universe liquidity. The section also demonstrates that blanket basis-point cost models are structurally wrong for options (where costs scale with premium), micro-cap equities (where spreads vary by orders of magnitude), and high-frequency strategies (where per-share models are needed). The strongest paper signals tend to concentrate where capacity is most constrained, creating a fundamental tension between signal strength and deployability.

1 notebook

20.6

Stability Across Time and Regimes

Holdout Sharpe decay ranges from +45% (FX improves) to -247% (crypto reverses completely), with a median near 50% across the nine studies. Three distinct failure modes are identified: prediction decay (the signal genuinely weakens, as in US equities), translation decay (IC improves but portfolio construction loses value, as in ETFs), and structural break (regime shift invalidates learned patterns, as in crypto). Risk overlays are shown to be conditional rather than universal -- they help when drawdowns correlate with observable risk signals (US firm characteristics: managed Sharpe +3.92) but hurt when the signal reversal is structural (crypto). The taxonomy directs practitioners to the right lever rather than applying a generic fix.

2 notebooks

20.7

Causal Credibility: What Can We Actually Claim?

Chapter 15's double machine learning analysis produces a mixed causal scorecard: two studies achieve robust causal evidence (ETFs at 30% confounding bias, FX at 60%), four are suggestive with substantial bias, and three are inconclusive. The section argues that confounding bias should be reported alongside Sharpe as a complementary fragility indicator -- high bias means the signal is heavily exposed to shifts in momentum, volatility, and market factors, making it more likely to break when those relationships change. Predictive signals without causal identification are usable with appropriate risk management, but high-bias strategies deserve tighter risk budgets and more frequent re-evaluation.

20.8

What We Deliberately Left on the Table

Every result in the book is a deliberately constrained baseline: single pipeline pass, standard features, single hyperparameter sweep, no ensembles, simplest allocators. This section inventories these constraints and assesses their likely impact, identifying iteration (5-10 hypothesis-driven cycles) as the highest-leverage improvement, followed by feature engineering (domain-specific and alternative data), ensemble methods (model, horizon, and checkpoint averaging), and label engineering. The ensemble opportunity is particularly concrete: checkpoint ensembling directly addresses the S&P 500 equity-plus-options problem where CAE IC swings 0.14 across epochs. The constraints are the feature, not the bug -- they make comparisons informative while leaving substantial room for improvement.

20.9

The Practitioner's Playbook

The synthesis is distilled into a four-phase development sequence: data and diagnostics (validate quality, build cost models, choose labels deliberately), signal generation (run all model families, use the four-metric diagnostic bundle, check nonlinearity), strategy construction (test cadences, compare allocators, compute cost sensitivity, apply risk overlays), and validation with iteration (frozen holdout, Deflated Sharpe Ratio, failure-mode decomposition, 5-10 hypothesis-driven cycles). Per-case-study recommendations specify the next concrete step for each of the nine studies. The closing argument is that the pipeline is transferable but the results are not -- the reader's competitive advantage is iteration with their own data, features, and domain knowledge.