ML4T Diagnostic
ML4T Diagnostic Documentation
Feature validation, strategy diagnostics, and Deflated Sharpe Ratio
Skip to content

Statistical Tests

ML4T Diagnostic implements rigorous statistical tests to prevent false discoveries and account for multiple testing bias.

Deflated Sharpe Ratio (DSR)

DSR asks whether the selected leader still looks credible after accounting for the fact that you searched across multiple candidates.

from ml4t.diagnostic.evaluation.stats import deflated_sharpe_ratio, effective_number_of_trials

# returns_matrix shape: (n_periods, n_strategies)
k_eff = effective_number_of_trials(returns_matrix, method="effective_rank")

result = deflated_sharpe_ratio(
    returns=returns_matrix,
    frequency="daily",
    periods_per_year=252,
    correlation_method="effective_rank",
    min_k_eff=2.0,
)

print(f"Raw trials: {result.n_trials_raw}")
print(f"Effective trials: {result.n_trials_effective:.2f}")
print(f"Observed Sharpe: {result.sharpe_ratio:.2f}")
print(f"Deflated Sharpe: {result.deflated_sharpe:.2f}")
print(f"p-value: {result.p_value:.4f}")

When to Use DSR

  • After trying multiple strategy variations
  • When those variants are strongly correlated and raw K over-penalizes the leader
  • When selecting among several candidate strategies
  • To report statistically honest performance

DSR Formula (López de Prado et al. 2025)

For multiple strategies, DSR replaces the naive null SR_0 = 0 with an expected-maximum Sharpe threshold:

\[ SR_0 = E[\max\{\widehat{SR}_k\}] \]

and then evaluates the leader's Sharpe ratio relative to that harder hurdle. When you provide correlation_method=..., the library uses K_eff instead of raw K in the False Strategy Theorem adjustment.

effective_rank is the recommended default. marchenko_pastur uses the iterative denoising variant, and clustering estimates the number of independent strategy families.

See Deflated Sharpe Ratio for details.

Rademacher Anti-Serum (RAS)

RAS detects backtest overfitting using complexity theory:

from ml4t.diagnostic.evaluation.stats import rademacher_complexity, ras_sharpe_adjustment

# returns_matrix shape: (n_periods, n_strategies)
complexity = rademacher_complexity(returns_matrix)
observed_sharpes = returns_matrix.mean(axis=0) / returns_matrix.std(axis=0)
result = ras_sharpe_adjustment(
    observed_sharpe=observed_sharpes,
    complexity=complexity,
    n_samples=returns_matrix.shape[0],
    n_strategies=returns_matrix.shape[1],
    return_result=True,
)

print(f"Number significant after RAS: {result.n_significant}")
print(f"Complexity penalty: {result.complexity:.4f}")

Interpretation

RAS Result Interpretation
High RAS Strategy is robust, not overfit
Low RAS Strategy may be overfit
Negative RAS Strategy is likely spurious

Minimum Track Record Length (MinTRL)

Calculate how long a track record must be for statistical significance:

from ml4t.diagnostic.evaluation.stats import compute_min_trl

result = compute_min_trl(
    sharpe_ratio=1.5,
    target_pvalue=0.05,
    frequency='daily'
)

print(f"Minimum observations: {result.min_observations}")
print(f"Minimum years: {result.min_years:.1f}")

MinTRL with Multiple Testing

For FWER-controlled significance across multiple strategies:

from ml4t.diagnostic.evaluation.stats import min_trl_fwer

result = min_trl_fwer(
    sharpe_ratio=1.5,
    num_trials=50,
    alpha=0.05
)

False Discovery Rate (FDR)

Control the expected proportion of false positives:

from ml4t.diagnostic.evaluation.stats import benjamini_hochberg_fdr

pvalues = [0.01, 0.03, 0.05, 0.08, 0.12]
rejected = benjamini_hochberg_fdr(p_values=pvalues, alpha=0.05)

# Identify discoveries
discoveries = rejected

Methods

Method Description
bh Benjamini-Hochberg (controls FDR)
by Benjamini-Yekutieli (conservative)
holm Holm-Bonferroni (controls FWER)

HAC-Adjusted Statistics

Account for heteroskedasticity and autocorrelation:

from ml4t.diagnostic.evaluation.stats import hac_adjusted_ic

result = hac_adjusted_ic(
    predictions=predictions,
    returns=forward_returns,
    return_details=True,
)

print(f"HAC t-stat: {result['t_stat']:.2f}")
print(f"HAC std error: {result['bootstrap_std']:.4f}")

Probability of Backtest Overfitting (PBO)

Estimate the probability that an optimal strategy is overfit:

from ml4t.diagnostic.evaluation.stats import compute_pbo

result = compute_pbo(
    is_performance=is_returns_matrix,
    oos_performance=oos_returns_matrix,
)

print(f"PBO: {result.pbo:.1%}")  # e.g., "32.5%"

Interpretation

PBO Interpretation
< 10% Low overfitting risk
10-30% Moderate risk
> 30% High overfitting risk

See It In The Book

These statistical tests appear repeatedly in the book:

  • FDR, DSR, MinTRL, and PBO: code/07_defining_learning_task/07_multiple_testing.py
  • HAC-adjusted IC in causal and robustness checks: code/07_defining_learning_task/08_causal_sanity_checks.py
  • HAC-adjusted IC plus FDR in the case studies: code/case_studies/*/05_evaluation.py
  • DSR on real backtest returns: code/16_strategy_simulation/12_dsr_validation.py
  • Sharpe inference and RAS workflow: code/16_strategy_simulation/11_sharpe_ratio_inference.py, code/16_strategy_simulation/13_ras_protocol.py

For the chapter-level map, see the Book Guide.

References

  • López de Prado et al. (2025). "How to Use the Sharpe Ratio"
  • Bailey & López de Prado (2014). "The Deflated Sharpe Ratio"
  • Paleologo, G. (2024). Elements of Quantitative Investing