Deflated Sharpe Ratio (DSR)¶
Use DSR when you evaluated multiple strategy variants and need to know whether the best Sharpe ratio still looks credible after accounting for selection bias.
The Problem¶
You tested 50 parameter combinations for a momentum strategy and selected the one with the highest Sharpe ratio of 1.5. Is this a genuine edge, or is it the inevitable result of picking the maximum from 50 random draws?
Even if every strategy has zero expected return, the best one will show a positive Sharpe ratio simply due to chance. The expected "spurious" Sharpe ratio grows logarithmically with the number of strategies tested:
For K=50 strategies with typical variance, this spurious maximum is around 0.4-0.8 -- enough to look like a tradeable signal. This is selection bias, and it is the most common source of backtest overfitting.
When the trial cohort is highly correlated, raw K can be too conservative.
ml4t-diagnostic therefore also supports correlation-adjusted
K_eff via effective_rank, marchenko_pastur, and clustering.
The Solution¶
The Deflated Sharpe Ratio adjusts an observed Sharpe ratio downward to account for the number of strategies tested. It answers: "What is the probability that the true Sharpe ratio exceeds zero, given that we selected the best of K strategies?"
Instead of testing \(H_0: SR = 0\), DSR tests \(H_0: SR = E[\max\{SR\}]\) -- a much harder threshold that accounts for the selection process.
Mathematical Foundation¶
DSR extends the Probabilistic Sharpe Ratio (PSR) to multiple testing using extreme value theory.
Step 1: Expected Maximum Under Null¶
For K independent strategies, the expected maximum of K standard normals:
where \(\gamma \approx 0.5772\) is the Euler-Mascheroni constant.
Step 2: Expected Maximum Sharpe Ratio¶
Scale by the empirical standard deviation of Sharpe ratios across strategies:
Step 3: Variance of Sharpe Ratio Estimator¶
Accounting for non-normality (skewness \(\gamma_3\), Pearson kurtosis \(\gamma_4\)):
Step 4: Deflated Test Statistic¶
The output is a probability in [0, 1]. Values above 0.95 indicate the strategy survives multiple testing correction at the 5% level.
Minimal Working Example¶
from ml4t.diagnostic.evaluation.stats import (
deflated_sharpe_ratio,
deflated_sharpe_ratio_from_statistics,
effective_number_of_trials,
)
import numpy as np
# Recommended path: pass raw returns for each trial
np.random.seed(42)
trial_returns = [
np.random.normal(0.0005, 0.01, 252),
np.random.normal(0.0008, 0.01, 252),
np.random.normal(0.0002, 0.012, 252),
]
k_eff = effective_number_of_trials(
np.column_stack(trial_returns),
method="effective_rank",
)
result = deflated_sharpe_ratio(
trial_returns,
frequency="daily",
correlation_method="effective_rank",
min_k_eff=2.0,
)
print(f"Probability of skill: {result.probability:.3f}")
print(f"P-value: {result.p_value:.3f}")
print(f"Expected max from noise: {result.expected_max_sharpe:.3f}")
print(f"Deflated Sharpe: {result.deflated_sharpe:.3f}")
print(f"K_eff: {result.n_trials_effective:.2f} (raw K={result.n_trials_raw})")
# Secondary path: use pre-computed statistics if your pipeline already has them
stats_result = deflated_sharpe_ratio_from_statistics(
observed_sharpe=0.12,
n_samples=252,
n_trials=50,
variance_trials=0.03,
frequency="daily",
effective_trials=8.5,
correlation_method="effective_rank",
min_k_eff=2.0,
)
deflated_sharpe_ratio() is the recommended entry point for most users because
it derives the required moments directly from raw returns. Use
deflated_sharpe_ratio_from_statistics() when your pipeline already computes the
Sharpe moments and trial variance upstream.
Key Parameters¶
| Parameter | Description | Guidance |
|---|---|---|
returns |
Single return series or sequence of trial return series | Pass multiple trials for DSR, one series for PSR |
frequency |
Return frequency | "daily" by default; affects annualized display values |
benchmark_sharpe |
Null-hypothesis Sharpe threshold | Leave at 0.0 unless you need a stricter hurdle |
n_trials |
Total strategies tested | Relevant for the statistics-based helper; include all trials |
correlation_method |
Correlation-aware K_eff estimator |
Use "effective_rank" first; compare against "marchenko_pastur" or "clustering" when needed |
min_k_eff |
Conservative floor for correlation-adjusted K_eff |
Leave at 1.0 by default; raise only when you want an explicit residual multiplicity penalty |
variance_trials |
Var[{SR_1, ..., SR_K}] across all strategies | Must be computed, not assumed, when using the statistics-based helper |
n_samples |
Number of return observations | T >= 50 minimum, >= 252 recommended |
Correlation-Adjusted Trial Counts¶
When your candidate strategies are closely related, raw K can overstate the
true breadth of the search. The library supports three estimators for
correlation-adjusted K_eff:
| Estimator | What it measures | Notes |
|---|---|---|
effective_rank |
entropy-weighted eigenvalue breadth | Recommended default; smooth and cheap |
marchenko_pastur |
non-noise eigenvalue count | Uses the iterative denoising variant |
clustering |
number of independent strategy families | Also swaps in cluster-level Sharpe variance |
The deflated_sharpe_ratio() result reports both n_trials_raw and
n_trials_effective. If min_k_eff > 1, n_trials_effective reflects the
post-floor value actually used in the DSR adjustment.
Interpreting Results¶
| DSR Value | Interpretation | Action |
|---|---|---|
| >= 0.95 | Strategy survives multiple testing at 5% level | Proceed to out-of-sample validation |
| 0.50 - 0.95 | Inconclusive -- may be overfit | Gather more data or reduce strategy space |
| < 0.50 | Strategy is likely explained by selection bias | Reject -- do not deploy |
Critical: DSR is a necessary but not sufficient condition. A high DSR means the strategy might be real. You still need out-of-sample testing, walk-forward validation, and realistic transaction cost modeling.
Common Pitfalls¶
-
Fabricating
variance_trialsUsing
variance_trials=1.0as a "reasonable default" defeats the purpose. You must compute the actual variance from all K strategies tested. If you don't have access to all K Sharpe ratios, DSR cannot be meaningfully calculated. -
Undercounting trials
Every parameter variation, feature combination, and lookback period counts as a trial. If you tested 10 signals x 5 lookbacks x 3 thresholds = 150 trials, not 10.
-
Mixing return cadence and annualization
periods_per_yearshould reflect the cadence of the return series passed to DSR, not the rebalance cadence that generated those returns. Daily crypto return series usually needperiods_per_year=365, not252. -
Misinterpreting the output
DSR = 0.42 does not mean "the strategy has 42% of its claimed Sharpe." It means "there is 42% probability the true Sharpe exceeds zero after accounting for multiple testing."
-
Ignoring non-normality
Strategies with negative skewness and fat tails (common in trend-following) have higher Sharpe ratio estimation variance. Always provide
skewnessandexcess_kurtosisfrom your actual returns. -
Using DSR without enough data
With T < 50 observations, the standard error of the Sharpe ratio estimator is so large that DSR becomes unreliable regardless of the observed value.
See It In The Book¶
DSR appears in the validation chapters and in reporting workflows that compare many model or parameter variants:
- Ch07 for the statistical foundations
- Ch16-Ch19 for backtest evaluation and reporting pipelines
Use the Book Guide for exact notebook and case-study paths.
References¶
-
Lopez de Prado, M., Lipton, A., & Zoonekynd, V. (2025). "How to use the Sharpe Ratio: A multivariate case study." ADIA Lab Research Paper Series, No. 19. Reference implementation: github.com/zoonek/2025-sharpe-ratio
-
Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management, 40(5), 94-107.
Relationship to Other Methods¶
| Method | Advantages over DSR | Disadvantages |
|---|---|---|
| RAS | Handles correlated strategies, non-asymptotic bounds | Computationally expensive |
| FDR | Controls false discovery proportion, not just best strategy | No correlation handling |
| PSR | Simpler (single strategy) | No multiple testing correction |
Next Steps¶
- Statistical Tests - Place DSR alongside FDR, RAS, and related checks
- Cross-Validation - Pair DSR with leakage-safe model evaluation
- Validation Tiers - See where DSR fits in the full validation framework
- Book Guide - Jump to the notebook and case-study usage