ML4T Diagnostic
ML4T Diagnostic Documentation
Feature validation, strategy diagnostics, and Deflated Sharpe Ratio
Skip to content

ML4T Diagnostic

Validate signals, models, and backtest results so you can tell whether strong performance is robust or just an artifact of leakage, overfitting, or multiple testing.

ml4t.diagnostic is the validation layer in the ML4T stack. Use it after feature engineering and before or alongside backtesting to answer practical questions such as "is this Sharpe real?", "are these features actually predictive?", and "what is driving my worst trades?" If you are new to the library, start with the Quickstart. If you are coming from Machine Learning for Trading, Third Edition, use the Book Guide to jump from notebooks to the production API.

Chapters 6-9 develop validation techniques manually. This library implements the same CPCV, DSR, HAC-adjusted IC, and feature triage workflows as tested, reusable functions. Chapters 16-19 add reporting, attribution, and trade-level SHAP. See the Book Guide for the exact notebook-to-API map.

  • Is Your Sharpe Real? --- Deflated Sharpe Ratio corrects for multiple testing. Check whether your best backtest survived selection bias. Statistical Tests

  • Purged Cross-Validation --- CPCV and purged walk-forward with embargo and label-horizon handling. Validate without leakage between train and test sets. Cross-Validation

  • Feature And Trade Diagnostics --- HAC-adjusted IC, importance analysis, drift checks, and SHAP-based trade diagnostics. Find out what is actually predictive and what is failing. Feature Diagnostics

  • From Book To API --- The book develops these methods manually. The library packages them into reusable workflows for research and production reporting. Book Guide

Quick Example

If you already have a model that looks good in backtest, the fastest way to check whether it still looks credible after leakage-safe cross-validation and multiple-testing correction is ValidatedCrossValidation.

from ml4t.diagnostic import ValidatedCrossValidation
from ml4t.diagnostic.config import ValidatedCrossValidationConfig

config = ValidatedCrossValidationConfig(n_groups=10, n_test_groups=2, label_horizon=5)
vcv = ValidatedCrossValidation(config=config)
result = vcv.fit_evaluate(X, y, model, times=times)

print(f"Mean Sharpe: {result.mean_sharpe:.2f}")
print(f"DSR probability: {result.dsr:.4f}")
print(f"Significant: {result.is_significant}")

What You Can Validate Right Now

Your Model Looks Good. Is It Overfit?

Use Deflated Sharpe Ratio when you tested many variants and need to know whether the best result still looks real after selection bias.

from ml4t.diagnostic.evaluation.stats import deflated_sharpe_ratio

result = deflated_sharpe_ratio([strategy_a, strategy_b, strategy_c], frequency="daily")
print(f"Probability of skill: {result.probability:.3f}")
print(f"Expected max from noise: {result.expected_max_sharpe:.3f}")

Are Your Features Actually Predictive?

Use HAC-adjusted Information Coefficient statistics when naive IC t-stats are too optimistic because the signal is autocorrelated across time.

from ml4t.diagnostic.evaluation.metrics import compute_ic_hac_stats

stats = compute_ic_hac_stats(ic_series, ic_col="ic")
print(f"Mean IC: {stats['mean_ic']:.4f}")
print(f"HAC t-stat: {stats['t_stat']:.2f}")

Is Your Cross-Validation Leaking?

Use purged walk-forward or CPCV when forward labels and temporal dependence make standard KFold results unreliable.

from ml4t.diagnostic.splitters import WalkForwardCV

cv = WalkForwardCV(n_splits=5, train_size=252, test_size=63, label_horizon=5)
for train_idx, test_idx in cv.split(X):
    pass

What Is Driving Your Worst Trades?

Use trade-level SHAP diagnostics when summary metrics are not enough and you need to understand recurring failure modes in losing trades.

from ml4t.diagnostic.evaluation import TradeAnalysis, TradeShapAnalyzer

worst_trades = TradeAnalysis(trade_records).worst_trades(n=20)
result = TradeShapAnalyzer(model, features_df, shap_values).explain_worst_trades(worst_trades)
print(result.error_patterns[0].hypothesis)

For full HTML reporting from normalized surfaces, BacktestResult, or saved run artifacts, see Backtest Tearsheets.

Four-Tier Validation Framework

This is the organizing structure behind the library. It keeps feature triage, signal validation, backtest credibility, and portfolio analysis in one coherent path.

Tier Stage Focus Example Problem Caught
1 Pre-modeling Feature importance, interactions, drift A feature looks predictive in-sample but is unstable across regimes
2 During modeling Predictions, calibration, stability A model ranks signals inconsistently or loses IC after HAC adjustment
3 Post-modeling Performance metrics, statistical validity A strong Sharpe disappears after CPCV or DSR multiple-testing correction
4 Production Portfolio composition, risk, attribution Returns are concentrated in one exposure bucket or one recurring trade error mode

Statistical Methods

These are the core methods the library uses to turn "looks good" into "survives scrutiny."

Test Purpose
DSR Deflated Sharpe Ratio for multiple-testing correction
RAS Rademacher Anti-Serum for backtest overfitting detection
FDR Benjamini-Hochberg adjustment for many simultaneous tests
HAC Autocorrelation-robust IC significance testing

Installation

pip install ml4t-diagnostic

For SHAP workflows, Plotly reporting, and the ml4t-backtest bridge, see the Installation Guide for optional extras.

Where To Start

See It In The Book

ml4t.diagnostic is used throughout Machine Learning for Trading, Third Edition:

  • Ch06 for purged walk-forward CV and CPCV
  • Ch07 for HAC-adjusted IC, FDR, DSR, and PBO
  • Ch08-Ch09 for feature triage, robustness checks, and diagnostics
  • Ch16-Ch19 for performance reporting, allocator analysis, factor attribution, and trade-SHAP
  • Nine case studies under third_edition/code/case_studies/

Use the Book Guide when you want the exact notebook and case-study entry points.

Part of the ML4T Library Suite

ml4t-data -> ml4t-engineer -> ml4t-diagnostic -> ml4t-backtest -> ml4t-live

ml4t.diagnostic is the point in that workflow where you decide whether a signal, model, or backtest result is credible enough to carry forward.