ML4T Diagnostic
ML4T Diagnostic Documentation
Feature validation, strategy diagnostics, and Deflated Sharpe Ratio
Skip to content

Statistical Tests

ML4T Diagnostic implements rigorous statistical tests to prevent false discoveries and account for multiple testing bias.

Deflated Sharpe Ratio (DSR)

The DSR adjusts the Sharpe ratio for the number of backtests tried:

from ml4t.diagnostic.evaluation.stats import deflated_sharpe_ratio

result = deflated_sharpe_ratio(
    returns=strategy_returns,
    n_trials=100,             # How many strategies tested
    frequency='daily',        # Return frequency
    periods_per_year=252
)

print(f"Observed Sharpe: {result.sharpe_ratio:.2f}")
print(f"Deflated Sharpe: {result.deflated_sharpe:.2f}")
print(f"p-value: {result.p_value:.4f}")

When to Use DSR

  • After trying multiple strategy variations
  • When selecting among several candidate strategies
  • To report statistically honest performance

DSR Formula (López de Prado et al. 2025)

\[DSR = \Phi^{-1}\left(1 - e^{-\frac{1}{2}\gamma}\right)\]

where \(\gamma\) accounts for: - Number of trials - Expected maximum Sharpe under null hypothesis - Autocorrelation in returns

See Deflated Sharpe Ratio for details.

Rademacher Anti-Serum (RAS)

RAS detects backtest overfitting using complexity theory:

from ml4t.diagnostic.evaluation.stats import rademacher_complexity, ras_sharpe_adjustment

# returns_matrix shape: (n_periods, n_strategies)
complexity = rademacher_complexity(returns_matrix)
observed_sharpes = returns_matrix.mean(axis=0) / returns_matrix.std(axis=0)
result = ras_sharpe_adjustment(
    observed_sharpe=observed_sharpes,
    complexity=complexity,
    n_samples=returns_matrix.shape[0],
    n_strategies=returns_matrix.shape[1],
    return_result=True,
)

print(f"Number significant after RAS: {result.n_significant}")
print(f"Complexity penalty: {result.complexity:.4f}")

Interpretation

RAS Result Interpretation
High RAS Strategy is robust, not overfit
Low RAS Strategy may be overfit
Negative RAS Strategy is likely spurious

Minimum Track Record Length (MinTRL)

Calculate how long a track record must be for statistical significance:

from ml4t.diagnostic.evaluation.stats import compute_min_trl

result = compute_min_trl(
    sharpe_ratio=1.5,
    target_pvalue=0.05,
    frequency='daily'
)

print(f"Minimum observations: {result.min_observations}")
print(f"Minimum years: {result.min_years:.1f}")

MinTRL with Multiple Testing

For FWER-controlled significance across multiple strategies:

from ml4t.diagnostic.evaluation.stats import min_trl_fwer

result = min_trl_fwer(
    sharpe_ratio=1.5,
    num_trials=50,
    alpha=0.05
)

False Discovery Rate (FDR)

Control the expected proportion of false positives:

from ml4t.diagnostic.evaluation.stats import benjamini_hochberg_fdr

pvalues = [0.01, 0.03, 0.05, 0.08, 0.12]
rejected = benjamini_hochberg_fdr(p_values=pvalues, alpha=0.05)

# Identify discoveries
discoveries = rejected

Methods

Method Description
bh Benjamini-Hochberg (controls FDR)
by Benjamini-Yekutieli (conservative)
holm Holm-Bonferroni (controls FWER)

HAC-Adjusted Statistics

Account for heteroskedasticity and autocorrelation:

from ml4t.diagnostic.evaluation.stats import hac_adjusted_ic

result = hac_adjusted_ic(
    predictions=predictions,
    returns=forward_returns,
    return_details=True,
)

print(f"HAC t-stat: {result['t_stat']:.2f}")
print(f"HAC std error: {result['bootstrap_std']:.4f}")

Probability of Backtest Overfitting (PBO)

Estimate the probability that an optimal strategy is overfit:

from ml4t.diagnostic.evaluation.stats import compute_pbo

result = compute_pbo(
    is_performance=is_returns_matrix,
    oos_performance=oos_returns_matrix,
)

print(f"PBO: {result.pbo:.1%}")  # e.g., "32.5%"

Interpretation

PBO Interpretation
< 10% Low overfitting risk
10-30% Moderate risk
> 30% High overfitting risk

See It In The Book

These statistical tests appear repeatedly in the book:

  • FDR, DSR, MinTRL, and PBO: code/07_defining_learning_task/07_multiple_testing.py
  • HAC-adjusted IC in causal and robustness checks: code/07_defining_learning_task/08_causal_sanity_checks.py
  • HAC-adjusted IC plus FDR in the case studies: code/case_studies/*/05_evaluation.py
  • DSR on real backtest returns: code/16_strategy_simulation/12_dsr_validation.py
  • Sharpe inference and RAS workflow: code/16_strategy_simulation/11_sharpe_ratio_inference.py, code/16_strategy_simulation/13_ras_protocol.py

For the chapter-level map, see the Book Guide.

References

  • López de Prado et al. (2025). "How to Use the Sharpe Ratio"
  • Bailey & López de Prado (2014). "The Deflated Sharpe Ratio"
  • Paleologo, G. (2024). Elements of Quantitative Investing