ML4T Diagnostic
ML4T Diagnostic Documentation
Feature validation, strategy diagnostics, and Deflated Sharpe Ratio
Skip to content

Four-Tier Validation Framework

ML4T Diagnostic implements a comprehensive Four-Tier Validation Framework to prevent overfitting, data leakage, and false discoveries in quantitative trading strategies.

See it in the book

Ch06 covers leakage-aware cross-validation, Ch07 covers statistical inference, Ch16-Ch17 cover backtest and portfolio diagnostics, and Ch19 covers trade and risk analysis. Use the Book Guide for the exact chapter and case-study map.

Overview

┌─────────────────────────────────────────────────────────────────┐
│                    FOUR-TIER VALIDATION                         │
├─────────────────────────────────────────────────────────────────┤
│  Tier 1: Feature Analysis     │  Pre-modeling diagnostics       │
│  Tier 2: Model Diagnostics    │  During/after modeling          │
│  Tier 3: Backtest Analysis    │  Post-modeling validation       │
│  Tier 4: Portfolio Analysis   │  Production monitoring          │
└─────────────────────────────────────────────────────────────────┘

Tier 1: Feature Analysis

When: Before training any model Goal: Ensure features have predictive power and are suitable for modeling

Key Analyses

Analysis Purpose
Information Coefficient (IC) Correlation between features and future returns
IC Decay How quickly predictive power decays over time
Feature Importance MDI, PFI, MDA, SHAP rankings
Feature Interactions H-statistic for detecting interactions
Stationarity Tests ADF, KPSS, Phillips-Perron
Distribution Analysis Jarque-Bera, tail risk
Drift Detection PSI, Wasserstein distance

Example

from ml4t.diagnostic.evaluation import FeatureDiagnostics

fd = FeatureDiagnostics()
result = fd.run_diagnostics(features_df["feature_1"], name="feature_1")

# Check feature quality
print(result.summary())
print(result.health_score)

Tier 2: Signal And Model Diagnostics

When: During and after model training Goal: Validate predictions, factor behavior, and stability

Key Analyses

Analysis Purpose
Signal IC Cross-sectional predictive power
Quantile Spreads Economic monotonicity and long-short separation
Turnover / Half-Life Stability and trading cost pressure
Feature Importance MDI, PFI, SHAP, interaction structure
Event / Barrier Analysis Targeted model-behavior diagnostics

Example

from ml4t.diagnostic import analyze_signal

signal_result = analyze_signal(
    factor=factor_df,
    prices=prices_df,
    periods=(1, 5, 21),
)

print(signal_result.summary())

Tier 3: Backtest Analysis

When: After backtesting Goal: Validate performance is not due to overfitting

Key Analyses

Analysis Purpose
Deflated Sharpe Ratio (DSR) Adjust for multiple testing
Probability of Backtest Overfitting (PBO) Likelihood of overfitting
Rademacher Anti-Serum (RAS) Complexity-adjusted returns
Minimum Track Record Length Required history for significance
Trade Error Analysis SHAP-based failure patterns

Example

from ml4t.diagnostic.evaluation.stats import deflated_sharpe_ratio

result = deflated_sharpe_ratio(
    returns=strategy_returns,
    n_trials=100,
    frequency="daily",
)

print(f"Probability of skill: {result.probability:.1%}")
print(f"Deflated Sharpe: {result.deflated_sharpe:.4f}")
print(f"p-value: {result.p_value:.4f}")
print(f"Is significant: {result.is_significant}")

Tier 4: Portfolio Analysis

When: In production Goal: Monitor live performance and risk

Key Analyses

Analysis Purpose
Rolling Metrics Sharpe, Sortino, Calmar over time
Drawdown Analysis Max drawdown, recovery time
Risk Attribution Factor exposures
Performance Attribution By sector, region, etc.

Example

from ml4t.diagnostic.evaluation import PortfolioAnalysis

analysis = PortfolioAnalysis(returns=strategy_returns, benchmark=benchmark_returns)
metrics = analysis.compute_summary_stats()

print(metrics.sharpe_ratio)
print(metrics.max_drawdown)

How This Relates To Evaluator

The library-level research workflow above is four-tiered. The Evaluator class is a separate, lower-level validation orchestrator that implements a three-tier statistical evaluation framework around splitters, metrics, and tests.

Use Evaluator when you want a configurable validation engine for model evaluation:

from ml4t.diagnostic import Evaluator

evaluator = Evaluator(tier=2)
result = evaluator.evaluate(X, y, model)

print(result.summary())

Use the higher-level workflows like analyze_signal(), FeatureDiagnostics, ValidatedCrossValidation, and PortfolioAnalysis when you want task-specific APIs instead of a generic evaluation engine.