Chapter 6

Strategy Research Framework

7 sections 4 notebooks 13 references Code

Learning Objectives

  • Place a strategy idea on the strategy map by linking it to a strategy family, a plausible source of edge, and the dominant feasibility constraints and failure modes.
  • Define a versioned trading setup in decision-time terms: what is tradable, when decisions are made, what information is admissible, how scores become positions, and which constraints and costs are treated as material.
  • Define "better" economically and keep model diagnostics, signal diagnostics, and strategy outcomes in distinct roles during research and evaluation.
  • Design a time-series evaluation protocol that preserves chronology, prevents overlap leakage, and separates model selection from final performance estimation.
  • Establish a narrow baseline checkpoint with timing, coverage, and trading-intensity sanity checks before expanding the search space.
  • Keep search auditable, reproducible, and countable using a simple trial taxonomy and automatic run logging.
Figure 6.1
6.1

From Idea to Evidence with the ML4T Workflow

The live trading loop is defined as a five-step cycle (observe, score, map to positions, execute, monitor) alongside its research counterpart, which builds evidence about live behavior without shifting assumptions. Nine case studies spanning ETFs, equities, crypto, futures, FX, and options serve as scaffolds for the remainder of the book. The reader takes away a concrete mental model of how research iteration mirrors deployment and why timing discipline is the single most important guardrail.

6.2

Mapping Strategies and Sources of Edge

A two-lens framework for evaluating strategy ideas before building models: strategy families (price-based, fundamental, microstructure, market mechanics) classify ideas by their dominant constraints, while sources of edge (risk compensation, liquidity provision, flow predictability, informational advantage) explain why returns might persist. The section documents why most published anomalies fail in practice — post-publication decay, implementation gaps, and definitional sensitivity — and teaches readers to answer four questions about any idea before committing to experimentation.

6.3

Defining the Rules of the Trading Game

The trading setup is specified as the fixed evaluation environment: universe rules, decision schedule, score-to-trade mapping, constraints, and cost model class. The section distinguishes mechanics changes (which require a new setup version) from parameter tuning (which stays within a version), using a detailed ETF momentum example. Comparability across experiments depends on keeping the trading setup invariant and versioned.

1 notebook

6.4

Setting Objectives and Evaluation Metrics

A three-layer metric framework separates model diagnostics (can the model learn the label?), signal diagnostics (does the output behave like a tradable signal?), and strategy outcomes (does the process produce economic value under costs?). Using strategy-level outcomes to drive every micro-decision during development invites overfitting to simulator details. The reader learns to keep metric roles separate and reserve strategy-level evaluation for late-stage confirmation.

6.5

Evaluation Protocol for Time Series

Five forms of data leakage (label, standardization, threshold, survivorship, point-in-time) explain why standard k-fold cross-validation fails on financial data. The section covers walk-forward CV with expanding and rolling windows, temporal buffers to prevent overlap leakage, sealed holdout test sets, nested walk-forward for rolling retuning, and combinatorial methods like CPCV. Evaluation design commitments — window lengths, step size, buffer sizes, test periods — are non-negotiable protocol choices rather than tuning parameters.

1 notebook

6.6

Establishing a Baseline Checkpoint

The baseline checkpoint is the smallest runnable specification that answers whether the trading setup supports enough stable structure to justify deeper work. Three preflight checks (timing sanity, coverage sanity, trading-intensity sanity) and a narrow first reference run earn the right to expand models and features in later chapters. A failed baseline usually calls for revising the setup rather than optimizing around a brittle definition.

6.7

Search Accounting and Run Logging

Every research iteration must be logged with provenance, configuration, artifact pointers, and decision gates to support comparability and reproducibility. A four-level trial taxonomy (strategy, trial family, trial, run) connects run logging to the Deflated Sharpe Ratio and pre-registration as defenses against selection bias. Without countable search, iteration becomes untraceable and selection bias becomes invisible.