Combinatorial Purged Cross-Validation (CPCV)¶
Use CPCV when a single walk-forward path is not enough and you need a distribution of out-of-sample outcomes to judge robustness, path dependence, and backtest overfitting.
The Problem¶
You backtested a strategy on 5 years of daily data and got a Sharpe ratio of 1.8. Is this a robust result, or did you overfit to a single historical path?
Standard backtesting gives you one number from one path through history. You have no way to assess the variability of that result. Standard k-fold cross-validation doesn't help either -- it assumes observations are independent, but financial time series have:
- Serial correlation: adjacent returns are dependent
- Overlapping labels: forward-looking targets create information leakage
If your label is "5-day forward return," then a training sample at day 95 has a label computed from prices on days 95-100. If day 98 is in the test set, training on sample 95 leaks test information.
The Solution¶
CPCV generates a distribution of backtest results instead of a single path. It partitions the time series into N groups, then evaluates the strategy on all \(\binom{N}{k}\) ways to choose k groups as test sets. Each combination produces an independent backtest path with proper train/test separation.
The key innovations over standard cross-validation:
- Purging: removes training samples whose labels overlap with test data
- Embargo: adds buffer zones after test periods to handle autocorrelation
- Combinatorial paths: generates dozens to hundreds of evaluation paths
With a distribution of results, you can ask: "What fraction of backtest paths are profitable?" If less than 50%, the strategy is likely overfit.
Mathematical Foundation¶
Partition and Combination¶
Given T observations, divide into N contiguous groups of approximately T/N samples each. Choose k groups for testing, giving \(\binom{N}{k}\) total combinations:
| Configuration | Combinations | Test Fraction |
|---|---|---|
| N=6, k=2 | 15 | 33% |
| N=8, k=2 | 28 | 25% |
| N=10, k=3 | 120 | 30% |
| N=12, k=4 | 495 | 33% |
Purging¶
For test group spanning indices \([t_s, t_e]\) and label horizon \(h\):
This eliminates training samples whose forward-looking labels extend into the test period.
Embargo¶
After each test group, exclude an additional buffer of \(e\) samples from training:
This handles autocorrelation -- samples immediately after a test period may carry correlated information from within the test window.
Backtest Overfitting Probability¶
The probability of backtest overfitting (PBO) is estimated as:
A PBO > 0.50 indicates the strategy is more likely overfit than genuine.
Minimal Working Example¶
from ml4t.diagnostic.splitters import CombinatorialCV
import numpy as np
# Your time-series data
X = np.random.randn(2000, 10) # 2000 samples, 10 features
y = np.random.randn(2000) # Target (e.g., forward returns)
# Configure CPCV
cv = CombinatorialCV(
n_groups=8, # Split into 8 time groups
n_test_groups=2, # 2 groups for testing per combination → C(8,2) = 28 paths
label_horizon=5, # Labels look 5 samples forward (purging)
embargo_size=2, # 2-sample buffer after test groups
max_combinations=20, # Cap at 20 paths for efficiency
)
# Evaluate your strategy across all paths
scores = []
for fold, (train_idx, test_idx) in enumerate(cv.split(X)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train and evaluate your model
# model.fit(X_train, y_train)
# score = model.score(X_test, y_test)
# scores.append(score)
pass
# Analyze distribution of results
# pbo = np.mean(np.array(scores) < 0)
# print(f"PBO: {pbo:.1%}") # > 50% → likely overfit
Multi-Asset Support¶
For multi-asset strategies, CPCV handles each asset independently to prevent cross-asset information leakage:
import polars as pl
# Panel data with asset identifiers
df = pl.DataFrame({
"date": dates,
"symbol": symbols,
"features": feature_values,
"target": targets,
})
cv = CombinatorialCV(
n_groups=8,
n_test_groups=2,
label_horizon=5,
embargo_size=2,
)
# groups parameter enables per-asset purging
for train_idx, test_idx in cv.split(X, groups=df["symbol"]):
# Each split purges correctly within each asset
pass
Key Parameters¶
| Parameter | Description | Guidance |
|---|---|---|
n_groups |
Number of time partitions | 6-12 typical; more = more paths but smaller test sets |
n_test_groups |
Groups held out for testing per split | 2-4 typical; higher = larger test sets but fewer paths |
label_horizon |
Forward-looking label window size | Must match your target definition (e.g., 5 for 5-day returns) |
embargo_size |
Buffer after test groups | 1-5 typical; higher for strongly autocorrelated data |
max_combinations |
Cap on number of splits | Use when C(N,k) is very large (e.g., C(12,4) = 495) |
For serialized configs and saved fold artifacts, see the CV Configuration guide.
Interpreting Results¶
Probability of Backtest Overfitting (PBO)¶
| PBO Range | Interpretation | Action |
|---|---|---|
| < 0.25 | Strong evidence of genuine strategy | Proceed to live testing |
| 0.25 - 0.50 | Some evidence, but uncertain | Increase data or simplify strategy |
| > 0.50 | More likely overfit than genuine | Reject -- do not deploy |
Distribution Analysis¶
Beyond PBO, examine the full distribution of backtest scores:
- Median performance: more robust than mean (outlier-resistant)
- Score variance: high variance suggests fragile strategy
- Worst path: if worst path is catastrophic, strategy has hidden risks
- Skewness: negative skew means occasional large losses
Common Pitfalls¶
-
Ignoring label horizon
Setting
label_horizon=0when your target is 5-day forward returns creates severe data leakage. The purging mechanism only works if you accurately specify how far forward your labels look. -
Too few groups
With
n_groups=4, n_test_groups=2, you get only C(4,2) = 6 paths -- far too few for reliable PBO estimation. Use at least N=8 for meaningful distributions. -
No embargo with intraday data
Intraday data has strong autocorrelation over short horizons. Even with purging, adjacent samples carry correlated microstructure information. Always use embargo_size >= 1 for intraday strategies.
-
Confusing CPCV with standard k-fold
Standard k-fold doesn't purge or embargo. Using
sklearn.KFoldon financial time series produces inflated performance estimates. Always use CPCV or WalkForwardCV for temporal data. -
Treating PBO as a p-value
PBO = 0.30 does not mean "30% probability of overfitting." It means "30% of backtest paths showed negative performance." The interpretation depends on the strategy and market conditions.
See It In The Book¶
CPCV is introduced in the validation foundations material and then reused in the case studies for production-style training and evaluation:
code/06_strategy_definition/01_cv_foundations.py- case-study training workflows under
code/case_studies/*/
Use the Book Guide for the broader chapter map.
References¶
-
Lopez de Prado, M. (2018). "Advances in Financial Machine Learning." Wiley. Chapter 7: Cross-Validation in Finance. Chapter 12: Backtesting through Cross-Validation.
-
Bailey, D. H., Borwein, J. M., Lopez de Prado, M., & Zhu, Q. J. (2017). "The Probability of Backtest Overfitting." Journal of Computational Finance, 20(4), 39-69.
Comparison with WalkForwardCV¶
| Property | WalkForwardCV | CombinatorialCV |
|---|---|---|
| # of paths | N (sequential) | C(N,k) (combinatorial) |
| Uses all data for testing | No (expanding window) | Yes (every sample appears in test) |
| Detects overfitting | Limited | Yes (PBO) |
| Calendar-aware | Yes (trading sessions) | Yes (with calendar config) |
| Computational cost | Low | Higher (more paths) |
Next Steps¶
- Cross-Validation - Apply CPCV and compare it to walk-forward validation
- CV Configuration - Serialize configs and persist folds for reruns
- Deflated Sharpe Ratio - Combine path distributions with multiple-testing correction
- Book Guide - Jump to the chapter and case-study implementations