API Reference¶
This reference is organized by import surface rather than by source tree alone.
Recommended Imports¶
| Use case | Import surface |
|---|---|
| Stable application code | ml4t.diagnostic.api |
| Notebook and exploratory work | ml4t.diagnostic |
| Statistical primitives | ml4t.diagnostic.evaluation.stats |
| Splitters and fold persistence | ml4t.diagnostic.splitters |
| Signal analysis | ml4t.diagnostic.signal |
| Backtest bridges | ml4t.diagnostic.integration |
| Plotly figures and dashboards | ml4t.diagnostic.visualization |
Stable API (ml4t.diagnostic.api)¶
Use this module when you want imports that are less sensitive to future re-export cleanup at the package root.
| Category | Objects |
|---|---|
| Validation workflows | ValidatedCrossValidation, ValidatedCrossValidationConfig, validated_cross_val_score, ValidationResult, ValidationFoldResult |
| Diagnostics | FeatureDiagnostics, FeatureDiagnosticsResult, TradeAnalysis, PortfolioAnalysis, BarrierAnalysis |
| Signal analysis | analyze_signal, SignalResult |
| Splitters | CombinatorialCV, WalkForwardCV |
| Metrics | compute_ic_series, compute_ic_hac_stats, compute_mdi_importance, compute_permutation_importance, compute_shap_importance, compute_h_statistic, compute_shap_interactions, analyze_ml_importance, analyze_interactions |
Package-Level Convenience API (ml4t.diagnostic)¶
The package root re-exports the most common classes and configs for interactive use:
| Category | Objects |
|---|---|
| Core workflows | ValidatedCrossValidation, FeatureSelector, analyze_signal, BarrierAnalysis |
| Result types | SignalResult, data-quality schemas |
| Configuration | DiagnosticConfig, StatisticalConfig, PortfolioConfig, TradeConfig, SignalConfig, EventConfig, BarrierConfig, ReportConfig, RuntimeConfig |
| Optional visuals | selected barrier-analysis plot functions when viz dependencies are installed |
Configuration¶
config
¶
ML4T Diagnostic Configuration System.
This module provides comprehensive Pydantic v2 configuration schemas for the ML4T Diagnostic framework, covering:
- Feature Evaluation: Diagnostics, cross-feature analysis, feature-outcome relationships
- Portfolio Evaluation: Risk/return metrics, Bayesian comparison
- Statistical Framework: PSR, MinTRL, DSR, FDR for multiple testing correction
- Reporting: HTML, JSON, visualization settings
Examples:
Quick start with defaults:
Custom configuration:
>>> config = DiagnosticConfig(
... stationarity=StationaritySettings(significance_level=0.01),
... ic=ICSettings(lag_structure=[0, 1, 5, 10]),
... )
Load from YAML:
Use presets:
DiagnosticConfig
¶
Bases: BaseConfig
Consolidated configuration for feature analysis (single-level nesting).
Provides comprehensive feature diagnostics with direct access to all settings: - config.stationarity.enabled (not config.module_a.stationarity.enabled)
Examples¶
config = DiagnosticConfig( ... stationarity=StationaritySettings(significance_level=0.01), ... ic=ICSettings(lag_structure=[0, 1, 5, 10, 21]), ... ) config.to_yaml("diagnostic_config.yaml")
for_quick_analysis
classmethod
¶
Preset for quick exploratory analysis.
Source code in src/ml4t/diagnostic/config/feature_config.py
for_research
classmethod
¶
Preset for academic research (comprehensive).
Source code in src/ml4t/diagnostic/config/feature_config.py
for_production
classmethod
¶
Preset for production monitoring (fast, focused on drift).
Source code in src/ml4t/diagnostic/config/feature_config.py
StatisticalConfig
¶
Bases: BaseConfig
Consolidated configuration for statistical testing.
Orchestrates advanced Sharpe ratio analysis with multiple testing correction.
Examples¶
config = StatisticalConfig( ... psr=PSRSettings(target_sharpe=1.0), ... dsr=DSRSettings(n_trials=500), ... )
Or use presets¶
config = StatisticalConfig.for_research()
for_quick_check
classmethod
¶
Preset for quick overfitting check (PSR + DSR only).
Source code in src/ml4t/diagnostic/config/sharpe_config.py
for_research
classmethod
¶
Preset for academic research (comprehensive analysis).
Source code in src/ml4t/diagnostic/config/sharpe_config.py
for_publication
classmethod
¶
Preset for academic publication (very conservative).
Source code in src/ml4t/diagnostic/config/sharpe_config.py
PortfolioConfig
¶
Bases: BaseConfig
Consolidated configuration for portfolio evaluation.
Orchestrates portfolio performance analysis with metrics, Bayesian comparison, time aggregation, and drawdown analysis.
Examples¶
config = PortfolioConfig( ... metrics=MetricsSettings(risk_free_rate=0.02), ... bayesian=BayesianSettings(enabled=True), ... ) config.to_yaml("portfolio_config.yaml")
for_quick_analysis
classmethod
¶
Preset for quick exploratory analysis.
Source code in src/ml4t/diagnostic/config/portfolio_config.py
for_research
classmethod
¶
Preset for academic research.
Source code in src/ml4t/diagnostic/config/portfolio_config.py
for_production
classmethod
¶
Preset for production monitoring.
Source code in src/ml4t/diagnostic/config/portfolio_config.py
TradeConfig
¶
Bases: BaseConfig
Consolidated configuration for trade analysis.
Combines trade extraction, filtering, SHAP alignment, error pattern clustering, and hypothesis generation into a single configuration.
Examples¶
config = TradeConfig( ... extraction=ExtractionSettings(n_worst=50), ... clustering=ClusteringSettings(min_cluster_size=10), ... ) config.to_yaml("trade_config.yaml")
warn_low_min_trades
classmethod
¶
Warn if min_trades is very low.
Source code in src/ml4t/diagnostic/config/trade_analysis_config.py
for_quick_diagnostics
classmethod
¶
Preset for quick diagnostics (minimal clustering).
Source code in src/ml4t/diagnostic/config/trade_analysis_config.py
for_deep_analysis
classmethod
¶
Preset for comprehensive analysis.
Source code in src/ml4t/diagnostic/config/trade_analysis_config.py
for_production
classmethod
¶
Preset for production monitoring (efficient, focused).
Source code in src/ml4t/diagnostic/config/trade_analysis_config.py
SignalConfig
¶
Bases: BaseConfig
Consolidated configuration for signal analysis.
Combines analysis settings, RAS adjustment, visualization, and multi-signal batch analysis into a single configuration class.
Examples¶
config = SignalConfig( ... analysis=AnalysisSettings(quantiles=10, periods=(1, 5)), ... visualization=VisualizationSettings(theme="dark"), ... ) config.to_yaml("signal_config.yaml")
validate_quantile_labels_count
¶
Ensure quantile_labels matches quantiles count if provided.
Source code in src/ml4t/diagnostic/config/signal_config.py
EventConfig
¶
Bases: BaseConfig
Configuration for event study analysis.
Configures the event study methodology including window parameters, abnormal return model, and statistical test.
Attributes¶
window : WindowSettings Window configuration (estimation and event periods) model : str Model for computing normal/expected returns test : str Statistical test for significance confidence_level : float Confidence level for intervals min_estimation_obs : int Minimum observations in estimation window
Examples¶
config = EventConfig( ... window=WindowSettings(estimation_start=-252, event_end=10), ... model="market_model", ... test="boehmer", ... )
BarrierConfig
¶
Bases: BaseConfig
Consolidated configuration for barrier analysis.
Combines analysis settings, column mappings, and visualization options into a single configuration class.
Examples¶
config = BarrierConfig( ... analysis=AnalysisSettings(n_quantiles=5), ... visualization=VisualizationSettings(theme="dark"), ... ) config.to_yaml("barrier_config.yaml")
min_observations_per_quantile
property
¶
Minimum observations per quantile (shortcut).
hit_rate_min_observations
property
¶
Hit rate minimum observations (shortcut).
validate_column_uniqueness
¶
Ensure column names don't conflict.
Source code in src/ml4t/diagnostic/config/barrier_config.py
ReportConfig
¶
Bases: BaseConfig
Top-level configuration for reporting (Module E).
Orchestrates report generation: - Output formats (HTML, JSON, PDF) - HTML settings (templates, themes, tables) - Visualization (plots, colors, interactivity) - JSON structure
Attributes:
| Name | Type | Description |
|---|---|---|
output_format |
OutputFormatConfig
|
Output format configuration |
html |
HTMLConfig
|
HTML report configuration |
visualization |
VisualizationConfig
|
Visualization configuration |
json |
VisualizationConfig
|
JSON output configuration |
lazy_rendering |
bool
|
Don't generate plots until accessed |
cache_plots |
bool
|
Cache generated plots |
parallel_plotting |
bool
|
Generate plots in parallel |
n_jobs |
int
|
Parallel jobs for plotting |
Examples:
>>> # Quick start with defaults
>>> config = ReportConfig()
>>> reporter = Reporter(config)
>>> reporter.generate(results, output_name="my_strategy")
>>> # Custom configuration
>>> config = ReportConfig(
... output_format=OutputFormatConfig(
... formats=[ReportFormat.HTML, ReportFormat.PDF]
... ),
... html=HTMLConfig(
... template=ReportTemplate.SUMMARY,
... theme=ReportTheme.PROFESSIONAL
... ),
... visualization=VisualizationConfig(
... plot_dpi=300,
... save_plots=True
... )
... )
for_quick_report
classmethod
¶
Preset for quick HTML-only report (minimal plots).
Returns:
| Type | Description |
|---|---|
ReportConfig
|
Config optimized for speed |
Source code in src/ml4t/diagnostic/config/report_config.py
for_publication
classmethod
¶
Preset for publication-quality reports (high-res, all plots).
Returns:
| Type | Description |
|---|---|
ReportConfig
|
Config optimized for publication |
Source code in src/ml4t/diagnostic/config/report_config.py
for_programmatic_access
classmethod
¶
Preset for programmatic access (JSON only, no plots).
Returns:
| Type | Description |
|---|---|
ReportConfig
|
Config optimized for API/programmatic use |
Source code in src/ml4t/diagnostic/config/report_config.py
RuntimeConfig
¶
Bases: BaseConfig
Configuration for execution settings.
Centralizes computational resources, caching, and randomness across all evaluation functions. Pass as a separate parameter to analysis functions.
Attributes:
| Name | Type | Description |
|---|---|---|
n_jobs |
int
|
Number of parallel jobs (-1 for all cores, 1 for serial) |
cache_enabled |
bool
|
Enable caching of expensive computations |
cache_dir |
Path
|
Directory for cache storage |
cache_ttl |
int | None
|
Cache time-to-live in seconds (None for no expiration) |
verbose |
bool
|
Enable verbose output |
random_state |
int | None
|
Random seed for reproducibility |
Examples:
>>> from ml4t.diagnostic.config import RuntimeConfig, DiagnosticConfig
>>> runtime = RuntimeConfig(n_jobs=4, verbose=True)
>>> result = analyze_features(df, config=DiagnosticConfig(), runtime=runtime)
model_post_init
¶
ValidatedCrossValidationConfig
¶
Bases: BaseConfig
Configuration for ValidatedCrossValidation orchestration.
Signal Analysis¶
signal
¶
Signal analysis for factor/alpha evaluation.
This module provides tools for analyzing the predictive power of signals (factors) for future returns.
Main Entry Point¶
analyze_signal : Compute IC, quantile returns, spread, and turnover for a factor signal. This is the recommended way to use this module.
Example¶
from ml4t.diagnostic.signal import analyze_signal result = analyze_signal(factor_df, prices_df) print(result.summary()) result.to_json("results.json")
Building Blocks¶
For custom workflows, use the component functions:
- prepare_data : Join factor with prices and compute forward returns
- compute_ic_series : Compute IC time series
- compute_quantile_returns : Compute returns by quantile
- compute_turnover : Compute factor turnover rate
- filter_outliers : Remove cross-sectional outliers
- quantize_factor : Assign quantile labels
SignalResult
dataclass
¶
SignalResult(
ic,
ic_std,
ic_t_stat,
ic_p_value,
ic_ir=dict(),
ic_positive_pct=dict(),
ic_series=dict(),
quantile_returns=dict(),
spread=dict(),
spread_t_stat=dict(),
spread_p_value=dict(),
monotonicity=dict(),
ic_dates=dict(),
quantile_returns_std=dict(),
count_by_quantile=dict(),
spread_std=dict(),
turnover=None,
autocorrelation=None,
half_life=None,
n_assets=0,
n_dates=0,
date_range=("", ""),
periods=(),
quantiles=5,
)
Immutable result from signal analysis.
All metrics are keyed by period (e.g., "1D", "5D", "21D").
Attributes¶
ic : dict[str, float] Mean IC by period. ic_std : dict[str, float] IC standard deviation by period. ic_t_stat : dict[str, float] T-statistic for IC != 0. ic_p_value : dict[str, float] P-value for IC significance. ic_ir : dict[str, float] Information Ratio (IC mean / IC std) by period. ic_positive_pct : dict[str, float] Percentage of periods with positive IC. ic_series : dict[str, list[float]] IC time series by period. quantile_returns : dict[str, dict[int, float]] Mean returns by period and quantile. spread : dict[str, float] Top minus bottom quantile spread. spread_t_stat : dict[str, float] T-statistic for spread. spread_p_value : dict[str, float] P-value for spread significance. monotonicity : dict[str, float] Rank correlation of quantile returns (how monotonic). turnover : dict[str, float] | None Mean turnover rate by period. autocorrelation : list[float] | None Factor autocorrelation at lags 1, 2, ... half_life : float | None Estimated signal half-life in periods. n_assets : int Number of unique assets. n_dates : int Number of unique dates. date_range : tuple[str, str] (first_date, last_date). periods : tuple[int, ...] Forward return periods analyzed. quantiles : int Number of quantiles used.
summary
¶
Human-readable summary of results.
Source code in src/ml4t/diagnostic/signal/result.py
to_ic_result
¶
Convert to SignalICResult for visualization functions.
Parameters¶
period : int | str | None Specific period (e.g. 21 or "21D"). If None, includes all periods aligned to their common date intersection.
Returns¶
SignalICResult Pydantic model compatible with plot_ic_ts, plot_ic_histogram, etc.
Raises¶
ValueError If ic_dates is empty (result created without date capture).
Examples¶
result = analyze_signal(factor_df, prices_df) plot_ic_ts(result.to_ic_result()) plot_ic_ts(result.to_ic_result(period=21))
Source code in src/ml4t/diagnostic/signal/result.py
to_quantile_result
¶
Convert to QuantileAnalysisResult for visualization functions.
Returns¶
QuantileAnalysisResult Pydantic model compatible with plot_quantile_returns_bar, etc.
Examples¶
result = analyze_signal(factor_df, prices_df) plot_quantile_returns_bar(result.to_quantile_result())
Source code in src/ml4t/diagnostic/signal/result.py
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | |
to_tear_sheet
¶
Convert to full SignalTearSheet for dashboard display.
Bundles to_ic_result() and to_quantile_result() into a SignalTearSheet.
Parameters¶
signal_name : str Name for the signal (used in dashboard title).
Returns¶
SignalTearSheet Pydantic model with IC and quantile analysis components.
Examples¶
result = analyze_signal(factor_df, prices_df) tear_sheet = result.to_tear_sheet("momentum_21d") tear_sheet.show()
Source code in src/ml4t/diagnostic/signal/result.py
to_dict
¶
to_json
¶
Export to JSON string or file.
Parameters¶
path : str | None If provided, write to file. Otherwise return string. indent : int JSON indentation level.
Returns¶
str JSON string.
Source code in src/ml4t/diagnostic/signal/result.py
from_json
classmethod
¶
Load from JSON file.
Parameters¶
path : str Path to JSON file.
Returns¶
SignalResult Loaded result.
Source code in src/ml4t/diagnostic/signal/result.py
analyze_signal
¶
analyze_signal(
factor,
prices,
*,
periods=(1, 5, 21),
quantiles=5,
filter_zscore=3.0,
quantile_method="quantile",
ic_method="spearman",
compute_turnover_flag=True,
autocorrelation_lags=10,
min_assets=10,
factor_col="factor",
date_col="date",
asset_col="asset",
price_col="price",
)
Analyze a factor signal.
This is the main entry point for signal analysis. Computes IC, quantile returns, spread, monotonicity, and optionally turnover/autocorrelation.
Parameters¶
factor : DataFrame Factor data with columns: date, asset, factor. Higher factor values should predict higher returns. prices : DataFrame Price data with columns: date, asset, price. periods : tuple[int, ...] Forward return periods in trading days (default: 1, 5, 21 days). quantiles : int Number of quantiles for grouping assets (default: 5 quintiles). filter_zscore : float | None Z-score threshold for outlier filtering. None disables. quantile_method : str "quantile" (equal frequency) or "uniform" (equal width). ic_method : str "spearman" (rank correlation) or "pearson" (linear correlation). compute_turnover_flag : bool Whether to compute turnover and autocorrelation metrics. autocorrelation_lags : int Number of lags for autocorrelation analysis. min_assets : int Minimum assets per date for IC computation. factor_col, date_col, asset_col, price_col : str Column names.
Returns¶
SignalResult Analysis results with IC, quantile returns, spread, monotonicity, and optionally turnover metrics.
Examples¶
Basic usage:
result = analyze_signal(factor_df, prices_df) print(result.summary()) result.to_json("results.json")
With custom parameters:
result = analyze_signal( ... factor_df, prices_df, ... periods=(1, 5, 21, 63), ... quantiles=10, ... ic_method="pearson", ... )
Source code in src/ml4t/diagnostic/signal/core.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 | |
prepare_data
¶
prepare_data(
factor,
prices,
periods=(1, 5, 21),
quantiles=5,
filter_zscore=3.0,
quantile_method="quantile",
factor_col="factor",
date_col="date",
asset_col="asset",
price_col="price",
)
Prepare factor data for analysis.
Joins factor with prices, computes forward returns, filters outliers, and assigns quantiles.
Parameters¶
factor : DataFrame Factor data with columns: date, asset, factor. prices : DataFrame Price data with columns: date, asset, price. periods : tuple[int, ...] Forward return periods in trading days. quantiles : int Number of quantiles. filter_zscore : float | None Z-score threshold for outlier filtering. None disables. quantile_method : str "quantile" (equal frequency) or "uniform" (equal width). factor_col, date_col, asset_col, price_col : str Column names.
Returns¶
pl.DataFrame Prepared data with: date, asset, factor, quantile, {period}D_fwd_return.
Source code in src/ml4t/diagnostic/signal/core.py
compute_ic_series
¶
compute_ic_series(
data,
period,
method="spearman",
factor_col="factor",
date_col="date",
asset_col="asset",
min_obs=10,
)
Compute IC time series for a single period.
Parameters¶
data : pl.DataFrame Factor data with factor and forward return columns. period : int Forward return period in days. method : str, default "spearman" Correlation method ("spearman" or "pearson"). factor_col : str, default "factor" Factor column name. date_col : str, default "date" Date column name. asset_col : str, default "asset" Asset/entity column used for panel joins. min_obs : int, default 10 Minimum observations per date.
Returns¶
tuple[list[Any], list[float]] (dates, ic_values) for dates with valid IC.
Source code in src/ml4t/diagnostic/signal/signal_ic.py
compute_ic_summary
¶
Compute summary statistics for an IC series.
Parameters¶
ic_series : list[float] IC values over time.
Returns¶
dict[str, float] mean, std, t_stat, p_value, pct_positive
Source code in src/ml4t/diagnostic/signal/signal_ic.py
compute_quantile_returns
¶
Compute mean forward returns by quantile.
Parameters¶
data : pl.DataFrame Data with quantile and forward return columns. period : int Forward return period in days. n_quantiles : int Number of quantiles. quantile_col : str, default "quantile" Quantile column name.
Returns¶
dict[int, float] Mean return by quantile (1 = lowest factor).
Source code in src/ml4t/diagnostic/signal/quantile.py
compute_spread
¶
Compute long-short spread and statistics.
Parameters¶
data : pl.DataFrame Data with quantile and forward return columns. period : int Forward return period in days. n_quantiles : int Number of quantiles. quantile_col : str, default "quantile" Quantile column name.
Returns¶
dict[str, float] spread, t_stat, p_value
Source code in src/ml4t/diagnostic/signal/quantile.py
compute_turnover
¶
Compute mean turnover rate across quantiles.
Turnover = fraction of assets that change quantile each period.
Parameters¶
data : pl.DataFrame Data with date, asset, and quantile columns. n_quantiles : int Number of quantiles. date_col, asset_col, quantile_col : str Column names.
Returns¶
float Mean turnover rate (0-1).
Source code in src/ml4t/diagnostic/signal/turnover.py
estimate_half_life
¶
Estimate half-life from autocorrelation decay.
Half-life is the lag where autocorrelation drops to 50% of lag-1 value.
Parameters¶
autocorrelations : list[float] Autocorrelation at lags 1, 2, 3, ...
Returns¶
float | None Half-life in periods, or None if undefined.
Source code in src/ml4t/diagnostic/signal/turnover.py
Cross-Validation¶
splitters
¶
Time-series cross-validation splitters for financial data.
This module provides cross-validation methods designed specifically for financial time-series data, addressing common issues like data leakage and backtest overfitting.
BaseSplitter
¶
Bases: ABC
Abstract base class for all ml4t-diagnostic time-series splitters.
This class defines the interface that all splitters must implement to ensure compatibility with scikit-learn's model selection tools while providing additional functionality for financial time-series validation.
All splitters should support purging (removing training data that could leak information into test data) and embargo (adding gaps between train and test sets to account for serial correlation).
Session-Aware Splitting¶
Splitters can optionally align fold boundaries to trading session boundaries
by setting align_to_sessions=True. This requires the data to have a
session column (default: 'session_date') that identifies trading sessions.
Trading sessions are atomic units that should never be split across train/test folds. For intraday data (e.g., CME futures with Sunday 5pm - Friday 4pm sessions), this prevents subtle lookahead bias from mid-session splits.
Integration with qdata library:
The session column should be added using the qdata library's session
assignment functionality::
from qdata import DataManager
manager = DataManager()
df = manager.load(symbol="BTC", exchange="CME", calendar="CME_Globex_Crypto")
# df now has 'session_date' column automatically assigned
Or manually using SessionAssigner::
from ml4t.data.sessions import SessionAssigner
assigner = SessionAssigner.from_exchange('CME')
df_with_sessions = assigner.assign_sessions(df)
Then use with ml4t-diagnostic splitters::
from ml4t.diagnostic.splitters import WalkForwardCV
cv = WalkForwardCV(
n_splits=5,
align_to_sessions=True, # Align folds to session boundaries
session_col='session_date'
)
for train_idx, test_idx in cv.split(df_with_sessions):
# Fold boundaries respect session boundaries
pass
split
abstractmethod
¶
Generate indices to split data into training and test sets.
Parameters¶
X : polars.DataFrame, pandas.DataFrame, or numpy.ndarray Training data with shape (n_samples, n_features).
polars.Series, pandas.Series, numpy.ndarray, or None, default=None
Target variable with shape (n_samples,). Always ignored but kept for scikit-learn compatibility.
polars.Series, pandas.Series, numpy.ndarray, or None, default=None
Group labels for samples, used for multi-asset splitting. Shape (n_samples,).
Yields:¶
train : numpy.ndarray The training set indices for that split.
numpy.ndarray
The testing set indices for that split.
Notes:¶
The indices returned are integer positions, not labels or timestamps. This ensures compatibility with numpy array indexing and scikit-learn.
Source code in src/ml4t/diagnostic/splitters/base.py
get_n_splits
¶
Return the number of splitting iterations in the cross-validator.
Parameters¶
X : polars.DataFrame, pandas.DataFrame, numpy.ndarray, or None, default=None Training data. Some splitters may use properties of X to determine the number of splits.
polars.Series, pandas.Series, numpy.ndarray, or None, default=None
Always ignored, exists for compatibility.
polars.Series, pandas.Series, numpy.ndarray, or None, default=None
Group labels. Some splitters may use this to determine splits.
Returns:¶
n_splits : int The number of splitting iterations.
Notes:¶
Most splitters can determine the number of splits from their parameters alone, but some (like GroupKFold variants) may need to inspect the data.
Source code in src/ml4t/diagnostic/splitters/base.py
CombinatorialCV
¶
CombinatorialCV(
config=None,
*,
n_groups=8,
n_test_groups=2,
label_horizon=0,
embargo_size=None,
embargo_pct=None,
max_combinations=None,
random_state=None,
align_to_sessions=False,
session_col="session_date",
timestamp_col=None,
isolate_groups=True,
)
Bases: BaseSplitter
Combinatorial Cross-Validation for backtest overfitting detection.
CPCV partitions the time series into N contiguous groups and forms all combinations C(N,k) of choosing k groups for testing. This generates multiple backtest paths instead of a single chronological split, providing a robust assessment of strategy performance and enabling detection of backtest overfitting.
How It Works¶
- Partitioning: Divide time-series data into N contiguous groups of equal size
- Combination Generation: Generate all C(N,k) combinations of choosing k groups for testing
- Label Overlap Removal: For each combination, remove training samples whose labels overlap test data
- Embargo Buffer: Optionally add buffer periods after test groups to exclude autocorrelated samples
- Multi-Asset Handling: When groups are provided, handle each asset independently
Label Horizon (label_horizon)¶
Why needed? When labels are forward-looking (e.g., 5-day returns), training samples near the test set have labels that "see into" the test period. Without removing these, the model trains on information about test outcomes, leading to inflated performance estimates.
How it works: For each test group with range [t_start, t_end]:
1. Remove train samples where: ``t_train > t_start - label_horizon``
2. This ensures no training sample's label period overlaps with test samples
Example::
Test group: samples 100-119 (20 samples)
label_horizon: 5 samples
Removes: training samples 95-99
Reason: Sample 95's label (computed from samples 95-100) overlaps test data
Embargo Buffer (embargo_size)¶
Why needed? Unlike walk-forward CV where training always precedes test, CPCV can have training groups that follow test groups chronologically. Samples immediately after test data may be autocorrelated with it.
How it works: Remove training samples in a buffer zone after each test group:
- **embargo_size**: Absolute number of samples (e.g., 10 samples)
- **embargo_pct**: Percentage of total samples (e.g., 0.01 = 1%)
Example::
Test group: samples 100-119
embargo_size: 5 samples
Additional removal: training samples 120-124
When this matters: If predicting volatility and the test period has a volatility
spike, samples 120-124 likely share similar volatility due to clustering.
Multi-Asset Handling¶
When groups parameter is provided (e.g., asset symbols), CPCV handles
each asset independently. This prevents cross-asset leakage:
Process: 1. For each asset, find its training and test samples 2. Apply label_horizon/embargo only to that asset's data 3. Combine results across all assets
Why Important? Without per-asset handling, information could leak between assets that trade at different times (e.g., European markets vs US markets).
Based on Bailey et al. (2014) "The Probability of Backtest Overfitting" and López de Prado (2018) "Advances in Financial Machine Learning".
Parameters¶
n_groups : int, default=8 Number of contiguous groups to partition the time series into.
int, default=2
Number of groups to use for testing in each combination.
int or pd.Timedelta, default=0
How far ahead labels look into the future. Removes training samples whose prediction targets overlap with test data.
int, optional
Maximum number of combinations to generate. If None, generates all C(N,k). Use this to limit computational cost for large N.
int, optional
Random seed for combination sampling when max_combinations is set.
bool, default=False
If True, align group boundaries to trading session boundaries. Requires X to have a session column (specified by session_col parameter).
Trading sessions should be assigned using the qdata library before cross-validation: - Use DataManager with exchange/calendar parameters, or - Use SessionAssigner.from_exchange('CME') directly
str, default='session_date'
Name of the column containing session identifiers. Only used if align_to_sessions=True. This column should be added by qdata.sessions.SessionAssigner
bool, default=True
If True, prevent the same group (asset/symbol) from appearing in both train and test sets. This is enabled by default for CPCV as it's designed for multi-asset validation.
Requires passing groups parameter to split() method with asset IDs.
Note: CPCV already applies per-asset purging when groups are provided. This parameter provides additional group isolation guarantee.
Attributes:¶
n_groups_ : int The number of groups.
int
The number of test groups.
Examples:¶
import numpy as np from ml4t.diagnostic.splitters import CombinatorialCV X = np.arange(200).reshape(200, 1) cv = CombinatorialCV(n_groups=6, n_test_groups=2, label_horizon=5) combinations = list(cv.split(X)) print(f"Generated {len(combinations)} combinations") Generated 15 combinations
Each combination provides train/test indices¶
for i, (train, test) in enumerate(combinations[:3]): ... print(f"Combination {i+1}: Train={len(train)}, Test={len(test)}") Combination 1: Train=125, Test=50 Combination 2: Train=125, Test=50 Combination 3: Train=125, Test=50
Notes:¶
The total number of combinations is C(n_groups, n_test_groups). For large values, this can become computationally expensive: - C(8,2) = 28 combinations - C(10,3) = 120 combinations - C(12,4) = 495 combinations
Use max_combinations to limit computational cost for large datasets.
Initialize CombinatorialCV.
This splitter uses a config-first architecture. You can either: 1. Pass a config object: CombinatorialCV(config=my_config) 2. Pass individual parameters: CombinatorialCV(n_groups=8, n_test_groups=2)
Parameters are automatically converted to a config object internally, ensuring a single source of truth for all validation and logic.
Examples¶
Approach 1: Direct parameters (convenient)¶
cv = CombinatorialCV(n_groups=10, n_test_groups=3)
Approach 2: Config object (for serialization/reproducibility)¶
from ml4t.diagnostic.splitters.config import CombinatorialConfig config = CombinatorialConfig(n_groups=10, n_test_groups=3) cv = CombinatorialCV(config=config)
Config can be serialized¶
config.to_json("cpcv_config.json") loaded = CombinatorialConfig.from_json("cpcv_config.json") cv = CombinatorialCV(config=loaded)
Source code in src/ml4t/diagnostic/splitters/combinatorial.py
316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 | |
get_n_splits
¶
Get number of splits (combinations).
Parameters¶
X : array-like, optional Always ignored, exists for compatibility.
array-like, optional
Always ignored, exists for compatibility.
array-like, optional
Always ignored, exists for compatibility.
Returns:¶
n_splits : int Number of combinations that will be generated.
Source code in src/ml4t/diagnostic/splitters/combinatorial.py
split
¶
Generate train/test indices for combinatorial splits with purging and embargo.
This method generates all combinations C(N,k) of train/test splits, applying purging and embargo to prevent information leakage. Each yielded split represents an independent backtest path.
Parameters¶
X : DataFrame or ndarray of shape (n_samples, n_features) Training data. Must have a datetime index if using Timedelta-based label_horizon or embargo_size.
Series or ndarray of shape (n_samples,), optional
Target variable. Not used in splitting logic, but accepted for API compatibility with scikit-learn.
Series or ndarray of shape (n_samples,), optional
Group labels for samples (e.g., asset symbols for multi-asset strategies).
When provided: - Purging is applied independently per group (asset) - Prevents information leakage across groups - Essential for multi-asset portfolio validation
Example: groups = df["symbol"] # ["AAPL", "MSFT", "GOOGL", ...]
Yields¶
train : ndarray of shape (n_train_samples,) Indices of training samples for this combination. Purging and embargo have been applied to remove: - Samples overlapping with test labels (purging) - Samples in embargo buffer after test groups (embargo)
ndarray of shape (n_test_samples,)
Indices of test samples for this combination. Consists of samples from the k selected test groups.
Raises¶
ValueError If X has incompatible shape or missing required columns (e.g., session_col when align_to_sessions=True).
TypeError If X index is not datetime when using Timedelta parameters.
Notes¶
Number of Combinations: Generates C(n_groups, n_test_groups) combinations. For example: - C(8,2) = 28 combinations - C(10,3) = 120 combinations - C(12,4) = 495 combinations
Use ``max_combinations`` parameter to limit the number of splits generated.
Purging Logic: For each test group: 1. Identify test sample range [t_start, t_end] 2. Remove training samples where: t_train > t_start - label_horizon 3. This prevents training on samples whose labels overlap with test period
Embargo Logic: After purging, additionally remove training samples: - In range [t_end + 1, t_end + embargo_size] - This accounts for serial correlation in financial time series
Multi-Asset Handling:
When groups is provided:
1. For each asset, find its training and test indices
2. Apply purging/embargo independently to that asset's data
3. Combine purged results across all assets
4. This prevents cross-asset information leakage
Session Alignment:
When align_to_sessions=True:
- Group boundaries align to trading session boundaries
- Ensures each group contains complete trading days/sessions
- Requires X to have column specified by session_col parameter
Examples¶
Basic usage with purging::
>>> import polars as pl
>>> from ml4t.diagnostic.splitters import CombinatorialCV
>>>
>>> # Create sample data
>>> n = 1000
>>> X = pl.DataFrame({"feature1": range(n), "feature2": range(n, 2*n)})
>>> y = pl.Series(range(n))
>>>
>>> # Configure CPCV
>>> cv = CombinatorialCV(
... n_groups=8,
... n_test_groups=2,
... label_horizon=5,
... embargo_size=2
... )
>>>
>>> # Generate splits
>>> for fold, (train_idx, test_idx) in enumerate(cv.split(X)):
... print(f"Fold {fold}: Train={len(train_idx)}, Test={len(test_idx)}")
Fold 0: Train=739, Test=250
Fold 1: Train=739, Test=250
...
Multi-asset usage::
>>> # Multi-asset data with symbol column
>>> symbols = pl.Series(["AAPL"] * 250 + ["MSFT"] * 250 +
... ["GOOGL"] * 250 + ["AMZN"] * 250)
>>>
>>> cv = CombinatorialCV(
... n_groups=6,
... n_test_groups=2,
... label_horizon=5,
... embargo_size=2,
... isolate_groups=True
... )
>>>
>>> for train_idx, test_idx in cv.split(X, groups=symbols):
... # Purging applied independently per asset
... train_symbols = symbols[train_idx].unique()
... test_symbols = symbols[test_idx].unique()
Session-aligned usage::
>>> import pandas as pd
>>>
>>> # Intraday data with session dates
>>> df = pd.DataFrame({
... "timestamp": pd.date_range("2024-01-01", periods=1000, freq="1min"),
... "session_date": pd.date_range("2024-01-01", periods=1000, freq="1min").date,
... "feature1": range(1000)
... })
>>>
>>> cv = CombinatorialCV(
... n_groups=10,
... n_test_groups=2,
... label_horizon=pd.Timedelta(minutes=30),
... embargo_size=pd.Timedelta(minutes=15),
... align_to_sessions=True,
... session_col="session_date"
... )
>>>
>>> for train_idx, test_idx in cv.split(df):
... # Group boundaries aligned to session boundaries
... pass
See Also¶
CombinatorialConfig : Configuration object for CPCV parameters apply_purging_and_embargo : Low-level purging/embargo function BaseSplitter : Base class for all splitters
Source code in src/ml4t/diagnostic/splitters/combinatorial.py
576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 | |
CombinatorialConfig
¶
Bases: SplitterConfig
Configuration for Combinatorial Cross-Validation (CPCV).
Combinatorial CV is designed for multi-asset strategies and combating overfitting by creating multiple test sets from combinatorial group selections.
Reference: Bailey & Lopez de Prado (2014) "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality"
Attributes¶
n_groups : int Number of groups to partition the timeline into (typically 8-12). n_test_groups : int Number of groups used for each test set (typically 2-3). Total folds = C(n_groups, n_test_groups). max_combinations : int | None Maximum number of folds to generate. If C(n_groups, n_test_groups) > max_combinations, randomly sample. contiguous_test_blocks : bool If True, only use contiguous test groups (reduces overfitting). If False, allow any combination (more folds).
validate_n_test_groups
classmethod
¶
Validate that n_test_groups < n_groups (must leave groups for training).
Source code in src/ml4t/diagnostic/splitters/config.py
validate_embargo_mutual_exclusivity
¶
Validate that embargo_td and embargo_pct are mutually exclusive.
Source code in src/ml4t/diagnostic/splitters/config.py
WalkForwardCV
¶
WalkForwardCV(
config=None,
*,
n_splits=5,
test_size=None,
train_size=None,
gap=0,
label_horizon=0,
embargo_size=None,
embargo_pct=None,
expanding=True,
consecutive=False,
calendar=None,
align_to_sessions=False,
session_col="session_date",
timestamp_col=None,
isolate_groups=False,
test_period=None,
test_start=None,
test_end=None,
fold_direction="forward",
)
Bases: BaseSplitter
Walk-forward cross-validator for time-series data.
Walk-forward CV creates sequential train/test splits where training data always precedes test data. Includes optional safeguards against data leakage from overlapping labels and autocorrelation.
Parameters¶
n_splits : int, default=5 Number of splits to generate.
int, float, str, or None, optional
Size of each test set: - If int: number of samples (e.g., 1000) - If float: proportion of dataset (e.g., 0.1) - If str: time period using pandas offset aliases (e.g., "4W", "30D", "3M") - If None: uses 1 / (n_splits + 1) Time-based specifications require X to have a DatetimeIndex.
int, float, str, or None, optional
Size of each training set: - If int: number of samples (e.g., 10000) - If float: proportion of dataset (e.g., 0.5) - If str: time period using pandas offset aliases (e.g., "78W", "6M", "2Y") - If None: uses all available data before test set Time-based specifications require X to have a DatetimeIndex.
int, default=0
Gap between training and test set (in addition to label_horizon).
int or pd.Timedelta, default=0
How far ahead labels look into the future. Removes training samples whose prediction targets overlap with validation/test data.
Example: If predicting 5-day forward returns, a training sample at day 95 has a label computed from prices on days 95-100. If validation starts at day 98, this training sample's label "sees" validation data, creating leakage. Setting label_horizon=5 removes training samples from days 93-97.
bool, default=False
If True, uses consecutive (back-to-back) test periods with no gaps. This is appropriate for walk-forward validation where you want to simulate realistic trading with sequential validation periods. If False, spreads test periods across the dataset to sample different time periods (useful for testing robustness across market regimes).
str, CalendarConfig, or TradingCalendar, optional
Trading calendar for calendar-aware time period calculations. - If str: Name of pandas_market_calendars calendar (e.g., 'CME_Equity', 'NYSE') Creates default CalendarConfig with UTC timezone - If CalendarConfig: Full configuration with exchange, timezone, and options - If TradingCalendar: Pre-configured calendar instance - If None: Uses naive time-based calculation (backward compatible)
For intraday data with time-based test_size/train_size (e.g., '4W'), using a calendar ensures proper session-aware splitting: - Trading sessions are atomic units (won't split Sunday 5pm - Friday 4pm) - Handles varying data density in activity-based data (dollar bars, trade bars) - Proper timezone handling for tz-naive and tz-aware data - '1D' selections: Complete trading sessions - '4W' selections: Complete trading weeks (e.g., 4 weeks of 5 sessions each)
Examples:
from ml4t.diagnostic.splitters.calendar_config import CME_CONFIG cv = WalkForwardCV(test_size='4W', calendar=CME_CONFIG) # CME futures cv = WalkForwardCV(test_size='1W', calendar='NYSE') # US equities (simple)
bool, default=False
If True, align fold boundaries to trading session boundaries. Requires X to have a session column (specified by session_col parameter).
Trading sessions should be assigned using the qdata library before cross-validation: - Use DataManager with exchange/calendar parameters, or - Use SessionAssigner.from_exchange('CME') directly
When enabled, fold boundaries will never split a trading session, preventing subtle lookahead bias in intraday strategies.
str, default='session_date'
Name of the column containing session identifiers. Only used if align_to_sessions=True. This column should be added by qdata.sessions.SessionAssigner
bool, default=False
If True, prevent the same group (asset/symbol) from appearing in both train and test sets. This is critical for multi-asset validation to avoid data leakage.
Requires passing groups parameter to split() method with asset IDs.
Example:
cv = WalkForwardCV(n_splits=5, isolate_groups=True) for train, test in cv.split(df, groups=df['symbol']): ... # train and test will have completely different symbols ... pass
Attributes:¶
n_splits_ : int The number of splits.
Examples:¶
import numpy as np from ml4t.diagnostic.splitters import WalkForwardCV X = np.arange(100).reshape(100, 1) cv = WalkForwardCV(n_splits=3, label_horizon=5, embargo_size=2) for train, test in cv.split(X): ... print(f"Train: {len(train)}, Test: {len(test)}") Train: 17, Test: 25 Train: 40, Test: 25 Train: 63, Test: 25
Initialize WalkForwardCV.
This splitter uses a config-first architecture. You can either: 1. Pass a config object: WalkForwardCV(config=my_config) 2. Pass individual parameters: WalkForwardCV(n_splits=5, test_size=100)
Parameters are automatically converted to a config object internally, ensuring a single source of truth for all validation and logic.
Examples¶
Approach 1: Direct parameters (convenient)¶
cv = WalkForwardCV(n_splits=5, test_size=100)
Approach 2: Config object (for serialization/reproducibility)¶
from ml4t.diagnostic.splitters.config import WalkForwardConfig config = WalkForwardConfig(n_splits=5, test_size=100) cv = WalkForwardCV(config=config)
Approach 3: With held-out test period¶
cv = WalkForwardCV( ... n_splits=5, ... test_period="52D", # Reserve most recent 52 days for final evaluation ... test_size=20, # 20-day validation folds ... train_size=252, # 1-year training windows ... label_horizon=5, # 5 trading days gap ... calendar="NYSE", # NYSE trading calendar ... fold_direction="backward", # Folds step backward from test ... )
Validation folds (step backward from held-out test)¶
for train_idx, val_idx in cv.split(X): ... model.fit(X.iloc[train_idx], y.iloc[train_idx])
Final evaluation on held-out test¶
test_score = model.score(X.iloc[cv.test_indices_], y.iloc[cv.test_indices_])
Source code in src/ml4t/diagnostic/splitters/walk_forward.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 | |
test_indices_
property
¶
Held-out test indices (populated after split() is called).
Returns¶
ndarray Indices reserved for the held-out test period.
Raises¶
ValueError If no held-out test is configured or split() hasn't been called.
Examples¶
cv = WalkForwardCV(n_splits=5, test_period="52D") for train_idx, val_idx in cv.split(X): ... pass # Training loop
Now test_indices_ is available¶
final_score = model.score(X.iloc[cv.test_indices_], y.iloc[cv.test_indices_])
get_n_splits
¶
Get number of splits.
Parameters¶
X : array-like, optional Always ignored, exists for compatibility.
array-like, optional
Always ignored, exists for compatibility.
array-like, optional
Always ignored, exists for compatibility.
Returns:¶
n_splits : int Number of splits.
Source code in src/ml4t/diagnostic/splitters/walk_forward.py
split
¶
Generate train/validation indices for walk-forward splits.
When a held-out test period is configured (test_period or test_start), this method yields train/validation splits for cross-validation, and the held-out test indices are accessible via test_indices_ property.
Parameters¶
X : array-like of shape (n_samples, n_features) Training data.
array-like of shape (n_samples,), optional
Target variable.
array-like of shape (n_samples,), optional
Group labels for samples.
Yields:¶
train : ndarray Training set indices for this split.
ndarray
Validation set indices for this split (or test if no held-out test).
Notes¶
When using held-out test mode with fold_direction="backward":
Validation folds step backward from the test boundary, ensuring that all validation is done on data chronologically before the held-out test.
Source code in src/ml4t/diagnostic/splitters/walk_forward.py
593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 | |
WalkForwardConfig
¶
Bases: SplitterConfig
Configuration for Walk-Forward Cross-Validation.
Walk-forward validation is the standard approach for time-series backtesting, where the model is trained on historical data and tested on future periods.
Attributes¶
test_size : int | float | str | None
Size of validation folds. Alias: val_size.
- int: Number of samples (or sessions if align_to_sessions=True)
- float: Proportion of dataset (0.0 to 1.0)
- str: Time-based ('4W', '3M') - NOT supported with align_to_sessions=True
- None: Auto-calculated to maintain equal test set sizes
train_size : int | float | str | None
Training set size specification (same format as test_size).
If None, uses expanding window (all data before test set).
step_size : int | None
Step size between consecutive splits:
- int: Number of samples (or sessions if align_to_sessions=True)
- None: Defaults to test_size (non-overlapping test sets)
test_period : int | str | None
Held-out test period specification (reserves most recent data for final evaluation):
- int: Number of trading days (requires calendar_id)
- str: Time-based ('52D', '4W')
- None: No held-out test period (default, legacy behavior)
test_start : date | str | None
Explicit start date for held-out test period. Mutually exclusive with test_period.
Accepts date object or ISO format string ('2024-01-01').
Alias: holdout_start.
test_end : date | str | None
Explicit end date for held-out test period. Default: end of data.
Accepts date object or ISO format string ('2024-12-31').
Alias: holdout_end.
fold_direction : Literal["forward", "backward"]
Direction of validation folds:
- "forward": Traditional walk-forward (folds step forward in time)
- "backward": Folds step backward from held-out test boundary
calendar_id : str | None
Trading calendar for trading-day-aware gap calculations.
Examples: "NYSE", "CME_Equity", "LSE"
Required when label_horizon is int and you want trading-day interpretation.
validate_size_with_sessions
classmethod
¶
Validate that time-based sizes are not used with session alignment.
Source code in src/ml4t/diagnostic/splitters/config.py
validate_test_dates
classmethod
¶
Convert string dates to date objects.
Source code in src/ml4t/diagnostic/splitters/config.py
validate_test_period
classmethod
¶
Validate test_period specification.
Source code in src/ml4t/diagnostic/splitters/config.py
validate_calendar_and_sessions
¶
Warn when align_to_sessions is used alongside calendar_id.
Source code in src/ml4t/diagnostic/splitters/config.py
validate_held_out_test_config
¶
Validate held-out test configuration consistency.
Source code in src/ml4t/diagnostic/splitters/config.py
SplitterConfig
¶
Bases: BaseConfig
Base configuration for all cross-validation splitters.
All splitter configs inherit from this class to ensure consistent serialization, validation, and reproducibility.
Attributes¶
n_splits : int Number of cross-validation folds.
int or pd.Timedelta
Gap between train_end and val_start sized to the label horizon. Removes training samples whose prediction targets overlap with validation/test data ("label buffer").
Example: If predicting 5-day forward returns, a training sample at day 95 has a label computed from days 95-100. If the test set starts at day 98, this training sample's label "sees" test data, creating leakage. Setting label_horizon=5 removes training samples from days 93-97.
Aliases: label_buffer is accepted as an equivalent input name.
bool
If True, fold boundaries are aligned to trading session boundaries. Requires 'session_date' column in data (from ml4t.data.sessions.SessionAssigner).
str
Column name containing session identifiers. Default: 'session_date' (standard qdata column name).
bool
If True, ensures no overlap between train/test group identifiers. Useful for multi-asset validation to prevent data leakage.
validate_label_horizon
classmethod
¶
Validate label_horizon is either int >= 0 or a timedelta-like object.
Source code in src/ml4t/diagnostic/splitters/config.py
validate_embargo_td
classmethod
¶
Validate embargo_td is either None, int >= 0, or a timedelta-like object.
Source code in src/ml4t/diagnostic/splitters/config.py
save_config
¶
Save splitter configuration to disk.
This is a convenience wrapper around config.to_json() for consistency with the persistence API.
Parameters¶
config : SplitterConfig Configuration object to save. filepath : str or Path Path to save configuration (JSON format).
Examples¶
from ml4t.diagnostic.splitters.config import WalkForwardConfig config = WalkForwardConfig(n_splits=5, test_size=100) save_config(config, "cv_config.json")
Source code in src/ml4t/diagnostic/splitters/persistence.py
load_config
¶
Load splitter configuration from disk.
This is a convenience wrapper around config_class.from_json() for consistency with the persistence API.
Parameters¶
filepath : str or Path Path to saved configuration (JSON format). config_class : type Configuration class to instantiate (e.g., WalkForwardConfig).
Returns¶
config : SplitterConfig Loaded configuration object.
Examples¶
from ml4t.diagnostic.splitters.config import WalkForwardConfig config = load_config("cv_config.json", WalkForwardConfig) print(config.n_splits)
Source code in src/ml4t/diagnostic/splitters/persistence.py
save_folds
¶
Save cross-validation folds to disk.
Parameters¶
folds : list[tuple[NDArray, NDArray]] List of (train_indices, test_indices) tuples from CV splitter. X : array-like or DataFrame Original data used for splitting (for timestamp extraction if DataFrame). filepath : str or Path Path to save fold configuration (JSON format). metadata : dict, optional Additional metadata to store (e.g., splitter config, data info). include_timestamps : bool, default=True If True and X is a DataFrame with DatetimeIndex, save timestamps alongside indices for better human readability.
Examples¶
from ml4t.diagnostic.splitters import WalkForwardCV cv = WalkForwardCV(n_splits=5, test_size=100) folds = list(cv.split(X)) save_folds(folds, X, "cv_folds.json", metadata={"n_splits": 5})
Source code in src/ml4t/diagnostic/splitters/persistence.py
load_folds
¶
Load cross-validation folds from disk.
Parameters¶
filepath : str or Path Path to saved fold configuration (JSON format).
Returns¶
folds : list[tuple[NDArray, NDArray]] List of (train_indices, test_indices) tuples. metadata : dict Metadata dictionary stored with folds.
Examples¶
folds, metadata = load_folds("cv_folds.json") print(f"Loaded {len(folds)} folds") print(f"Metadata: {metadata}")
Source code in src/ml4t/diagnostic/splitters/persistence.py
verify_folds
¶
Verify fold integrity and compute statistics.
Parameters¶
folds : list[tuple[NDArray, NDArray]] List of (train_indices, test_indices) tuples. n_samples : int Total number of samples in dataset.
Returns¶
stats : dict Dictionary containing fold statistics and validation results.
Examples¶
folds, _ = load_folds("cv_folds.json") stats = verify_folds(folds, n_samples=1000) print(f"Valid: {stats['valid']}") print(f"Coverage: {stats['coverage']:.1%}")
Source code in src/ml4t/diagnostic/splitters/persistence.py
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 | |
Evaluation Workflows¶
These workflows live under ml4t.diagnostic.evaluation:
| Area | Objects |
|---|---|
| Generic orchestration | Evaluator, EvaluationResult, ValidatedCrossValidation |
| Feature and signal diagnostics | FeatureDiagnostics, MultiSignalAnalysis, analyze_ml_importance, compute_ic_hac_stats |
| Portfolio and backtest evaluation | PortfolioAnalysis, factor attribution helpers |
| Trade diagnostics | TradeAnalysis, TradeShapAnalyzer, TradeShapResult |
| Event and barrier workflows | EventStudyAnalysis, BarrierAnalysis |
Statistical Tests¶
stats
¶
Statistical tests for financial ML evaluation.
This package implements advanced statistical tests used in ml4t-diagnostic's Three-Tier Framework:
Multiple Testing Corrections: - Deflated Sharpe Ratio (DSR) for selection bias correction - Rademacher Anti-Serum (RAS) for correlation-aware multiple testing - False Discovery Rate (FDR) and Family-Wise Error Rate (FWER) corrections
Time Series Inference: - HAC-adjusted Information Coefficient for autocorrelated data - Stationary bootstrap for temporal dependence preservation
Strategy Comparison: - White's Reality Check for multiple strategy comparison - Probability of Backtest Overfitting (PBO)
All tests are implemented with: - Mathematical correctness validated against academic references - Proper handling of autocorrelation and heteroskedasticity - Numerical stability for edge cases - Support for both single and multiple hypothesis testing
Module Decomposition (v1.4+)¶
The stats package is organized into focused modules:
Sharpe Ratio Analysis: - moments.py: Return statistics (Sharpe, skewness, kurtosis, autocorr) - sharpe_inference.py: Variance estimation, expected max calculation - minimum_track_record.py: Minimum Track Record Length - backtest_overfitting.py: Probability of Backtest Overfitting - deflated_sharpe_ratio.py: DSR/PSR orchestration layer (main entry points)
Other Statistical Tests: - rademacher_adjustment.py: Rademacher complexity and RAS adjustments - bootstrap.py: Stationary bootstrap methods - hac_standard_errors.py: HAC-adjusted IC estimation - false_discovery_rate.py: FDR and FWER corrections - reality_check.py: White's Reality Check
All original imports are preserved for backward compatibility.
deflated_sharpe_ratio_from_statistics
¶
deflated_sharpe_ratio_from_statistics(
observed_sharpe,
n_samples,
n_trials=1,
variance_trials=0.0,
benchmark_sharpe=0.0,
skewness=0.0,
excess_kurtosis=0.0,
autocorrelation=0.0,
confidence_level=0.95,
frequency="daily",
periods_per_year=None,
)
Compute DSR/PSR from pre-computed statistics.
Use this when you have already computed the required statistics.
For most users, deflated_sharpe_ratio() with raw returns is recommended.
Parameters¶
observed_sharpe : float Observed Sharpe ratio at native frequency. n_samples : int Number of return observations (T). n_trials : int, default 1 Number of strategies tested (K). variance_trials : float, default 0.0 Cross-sectional variance of Sharpe ratios. benchmark_sharpe : float, default 0.0 Null hypothesis threshold. skewness : float, default 0.0 Return skewness. excess_kurtosis : float, default 0.0 Return excess kurtosis (Fisher, normal=0). autocorrelation : float, default 0.0 First-order autocorrelation. confidence_level : float, default 0.95 Confidence level for testing. frequency : {"daily", "weekly", "monthly"}, default "daily" Return frequency. periods_per_year : int, optional Periods per year.
Returns¶
DSRResult
Same as deflated_sharpe_ratio().
Source code in src/ml4t/diagnostic/evaluation/stats/deflated_sharpe_ratio.py
426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 | |
compute_min_trl
¶
compute_min_trl(
returns=None,
observed_sharpe=None,
target_sharpe=0.0,
confidence_level=0.95,
frequency="daily",
periods_per_year=None,
*,
skewness=None,
excess_kurtosis=None,
autocorrelation=None,
)
Compute Minimum Track Record Length (MinTRL).
MinTRL is the minimum number of observations required to reject the null hypothesis (SR <= target) at the specified confidence level.
Parameters¶
returns : array-like, optional Return series. If provided, statistics are computed from it. observed_sharpe : float, optional Observed Sharpe ratio. Required if returns not provided. target_sharpe : float, default 0.0 Null hypothesis threshold (SR₀). confidence_level : float, default 0.95 Required confidence level (1 - α). frequency : {"daily", "weekly", "monthly"}, default "daily" Return frequency. periods_per_year : int, optional Periods per year (for converting to calendar time). skewness : float, optional Override computed skewness. excess_kurtosis : float, optional Override computed excess kurtosis (Fisher convention, normal=0). autocorrelation : float, optional Override computed autocorrelation.
Returns¶
MinTRLResult Results including min_trl, min_trl_years, and adequacy assessment. min_trl can be math.inf if observed SR <= target SR.
Examples¶
From returns:
result = compute_min_trl(daily_returns, frequency="daily") print(f"Need {result.min_trl_years:.1f} years of data")
From statistics:
result = compute_min_trl( ... observed_sharpe=0.5, ... target_sharpe=0.0, ... confidence_level=0.95, ... skewness=-1.0, ... excess_kurtosis=2.0, ... autocorrelation=0.1, ... )
Source code in src/ml4t/diagnostic/evaluation/stats/minimum_track_record.py
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 | |
min_trl_fwer
¶
min_trl_fwer(
observed_sharpe,
n_trials,
variance_trials,
target_sharpe=0.0,
confidence_level=0.95,
frequency="daily",
periods_per_year=None,
*,
skewness=0.0,
excess_kurtosis=0.0,
autocorrelation=0.0,
)
Compute MinTRL under FWER multiple testing adjustment.
When selecting the best strategy from K trials, the MinTRL must be adjusted to account for the selection bias.
Parameters¶
observed_sharpe : float Observed Sharpe ratio of the best strategy. n_trials : int Number of strategies tested (K). variance_trials : float Cross-sectional variance of Sharpe ratios. target_sharpe : float, default 0.0 Original null hypothesis threshold. confidence_level : float, default 0.95 Required confidence level. frequency : {"daily", "weekly", "monthly"}, default "daily" Return frequency. periods_per_year : int, optional Periods per year. skewness : float, default 0.0 Return skewness. excess_kurtosis : float, default 0.0 Return excess kurtosis (Fisher, normal=0). autocorrelation : float, default 0.0 Return autocorrelation.
Returns¶
MinTRLResult Results with min_trl adjusted for multiple testing.
Source code in src/ml4t/diagnostic/evaluation/stats/minimum_track_record.py
317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 | |
compute_pbo
¶
Compute Probability of Backtest Overfitting (PBO).
PBO measures the probability that a strategy selected as best in-sample performs below median out-of-sample. A high PBO indicates overfitting.
Definition¶
From Bailey & López de Prado (2014):
.. math::
PBO = P(rank_{OOS}(\arg\max_{IS}) > N/2)
In plain English: what's the probability that the best in-sample strategy ranks in the bottom half out-of-sample?
Interpretation¶
- PBO = 0%: No overfitting (best IS is also best OOS)
- PBO = 50%: Random selection (IS performance uncorrelated with OOS)
- PBO > 50%: Severe overfitting (IS selection is counterproductive)
Parameters¶
is_performance : np.ndarray, shape (n_folds, n_strategies) or (n_combinations,) In-sample performance metrics (Sharpe, IC, returns) for each strategy. oos_performance : np.ndarray, shape (n_folds, n_strategies) or (n_combinations,) Out-of-sample performance metrics (same structure as is_performance).
Returns¶
PBOResult Result object with PBO and diagnostic metrics. Call .interpret() for human-readable assessment.
Raises¶
ValueError If arrays have different shapes or fewer than 2 strategies.
Examples¶
import numpy as np
10 CV folds, 5 strategies¶
is_perf = np.random.randn(10, 5) oos_perf = np.random.randn(10, 5) result = compute_pbo(is_perf, oos_perf) print(result.interpret())
References¶
Bailey, D. H., & López de Prado, M. (2014). "The Probability of Backtest Overfitting." Journal of Computational Finance, 20(4), 39-69.
Source code in src/ml4t/diagnostic/evaluation/stats/backtest_overfitting.py
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | |
ras_ic_adjustment
¶
ras_ic_adjustment(
observed_ic,
complexity,
n_samples,
delta=0.05,
kappa=0.02,
return_result=False,
)
Apply RAS adjustment for Information Coefficients (bounded metrics).
Computes conservative lower bounds on true IC values accounting for data snooping and estimation error.
Formula (Hoeffding concentration for |IC| ≤ κ):
θₙ ≥ θ̂ₙ - 2R̂ - 2κ√(log(2/δ)/T)
─── ─────────────────
(a) (b)
where
(a) = data snooping penalty from testing N strategies (b) = estimation error for bounded r.v. (Hoeffding's inequality)
Parameters¶
observed_ic : ndarray of shape (N,)
Observed Information Coefficients for N strategies.
complexity : float
Rademacher complexity R̂ from rademacher_complexity().
n_samples : int
Number of time periods T used to compute ICs.
delta : float, default=0.05
Significance level (1 - confidence). Lower = more conservative.
kappa : float, default=0.02
Bound on |IC|. Critical parameter.
Practical guidance (Paleologo 2024, p.273):
- κ=0.02: Typical alpha signals
- κ=0.05: High-conviction signals
- κ=1.0: Theoretical maximum (usually too conservative)
return_result : bool, default=False If True, return RASResult dataclass with full diagnostics.
Returns¶
ndarray or RASResult If return_result=False: Adjusted IC lower bounds (N,). If return_result=True: RASResult with full diagnostics.
Raises¶
ValueError If inputs are invalid or observed ICs exceed kappa bound.
Warns¶
UserWarning If any |observed_ic| > κ (theoretical guarantee violated).
Notes¶
Derivation: 1. Data snooping: Standard Rademacher generalization bound gives 2R̂. 2. Estimation: For bounded r.v. |X| ≤ κ, Hoeffding gives P(|X̂ - X| > t) ≤ 2exp(-Tt²/2κ²). Setting RHS = δ yields t = κ√(2 log(2/δ)/T). Conservative factor 2 for two-sided.
Advantages over DSR: - Accounts for strategy correlation (R̂ ↓ as correlation ↑) - Non-asymptotic (valid for any T) - Zero false positives in Paleologo's simulations
Examples¶
import numpy as np X = np.random.randn(2500, 500) * 0.02 observed_ic = X.mean(axis=0) R_hat = rademacher_complexity(X) result = ras_ic_adjustment(observed_ic, R_hat, 2500, return_result=True) print(f"Significant: {result.n_significant}/{len(observed_ic)}")
References¶
.. [1] Paleologo (2024), Section 8.3.2, Procedure 8.1. .. [2] Hoeffding (1963), "Probability inequalities for sums of bounded random variables", JASA 58:13-30.
Source code in src/ml4t/diagnostic/evaluation/stats/rademacher_adjustment.py
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | |
ras_sharpe_adjustment
¶
ras_sharpe_adjustment(
observed_sharpe,
complexity,
n_samples,
n_strategies,
delta=0.05,
return_result=False,
)
Apply RAS adjustment for Sharpe ratios (sub-Gaussian metrics).
Computes conservative lower bounds on true Sharpe ratios accounting for data snooping, estimation error, and multiple testing.
Formula (sub-Gaussian concentration + union bound):
θₙ ≥ θ̂ₙ - 2R̂ - 3√(2 log(2/δ)/T) - √(2 log(2N/δ)/T)
─── ─────────────────────────────────────
(a) (b) (c)
where
(a) = data snooping penalty (b) = sub-Gaussian estimation error (factor 3 for conservatism) © = union bound over N strategies
Parameters¶
observed_sharpe : ndarray of shape (N,)
Observed (annualized) Sharpe ratios for N strategies.
complexity : float
Rademacher complexity R̂ from rademacher_complexity().
n_samples : int
Number of time periods T used to compute Sharpe ratios.
n_strategies : int
Total number of strategies N tested.
delta : float, default=0.05
Significance level (1 - confidence). Lower = more conservative.
return_result : bool, default=False
If True, return RASResult dataclass with full diagnostics.
Returns¶
ndarray or RASResult If return_result=False: Adjusted Sharpe lower bounds (N,). If return_result=True: RASResult with full diagnostics.
Notes¶
Derivation: 1. Data snooping: 2R̂ (standard Rademacher bound) 2. Sub-Gaussian error: For σ²-sub-Gaussian X, P(X > t) ≤ exp(-t²/2σ²). Daily returns typically have σ ≈ 1 when standardized. Factor 3 provides conservatism for heavier tails. 3. Union bound: P(∃n: |X̂ₙ - Xₙ| > t) ≤ N × single-strategy bound. Contributes √(2 log(2N/δ)/T) term.
Comparison to DSR: - DSR assumes independent strategies (overpenalizes correlated ones) - RAS captures correlation via R̂ (correlated → lower R̂ → less penalty) - RAS is non-asymptotic; DSR requires large T
Examples¶
import numpy as np returns = np.random.randn(252, 100) * 0.01 # 100 strategies, 1 year observed_sr = returns.mean(axis=0) / returns.std(axis=0) * np.sqrt(252) R_hat = rademacher_complexity(returns) result = ras_sharpe_adjustment( ... observed_sr, R_hat, 252, 100, return_result=True ... ) print(f"Significant: {result.n_significant}/100")
References¶
.. [1] Paleologo (2024), Section 8.3.2, Procedure 8.2.
Source code in src/ml4t/diagnostic/evaluation/stats/rademacher_adjustment.py
301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 | |
benjamini_hochberg_fdr
¶
Apply Benjamini-Hochberg False Discovery Rate correction.
Controls the False Discovery Rate (FDR) - the expected proportion of false discoveries among the rejected hypotheses. More powerful than Bonferroni correction for multiple hypothesis testing.
Based on Benjamini & Hochberg (1995): "Controlling the False Discovery Rate"
Parameters¶
p_values : Sequence[float] P-values from multiple hypothesis tests alpha : float, default 0.05 Target FDR level (e.g., 0.05 for 5% FDR) return_details : bool, default False Whether to return detailed information
Returns¶
Union[NDArray, dict] If return_details=False: Boolean array of rejected hypotheses If return_details=True: dict with 'rejected', 'adjusted_p_values', 'critical_values', 'n_rejected'
Examples¶
p_values = [0.001, 0.01, 0.03, 0.08, 0.12] rejected = benjamini_hochberg_fdr(p_values, alpha=0.05) print(f"Rejected: {rejected}") Rejected: [ True True True False False]
Source code in src/ml4t/diagnostic/evaluation/stats/false_discovery_rate.py
holm_bonferroni
¶
Holm-Bonferroni step-down procedure for FWER control.
Controls the Family-Wise Error Rate (FWER) - the probability of making at least one false discovery. More powerful than Bonferroni correction while maintaining strong FWER control.
Based on Holm (1979): "A Simple Sequentially Rejective Multiple Test Procedure"
Parameters¶
p_values : Sequence[float] P-values from multiple hypothesis tests alpha : float, default 0.05 Target FWER significance level
Returns¶
dict Dictionary with: - rejected: list[bool] - Whether each hypothesis is rejected - adjusted_p_values: list[float] - Holm-adjusted p-values - n_rejected: int - Number of rejections - critical_values: list[float] - Holm critical thresholds
Notes¶
The Holm procedure is a step-down method:
- Sort p-values ascending: p_(1) <= p_(2) <= ... <= p_(m)
- For p_(i), compare to alpha / (m - i + 1)
- Reject all hypotheses up to (and including) the last rejection
- Stop at first non-rejection; accept remaining hypotheses
This is uniformly more powerful than Bonferroni while controlling FWER.
Examples¶
p_values = [0.001, 0.01, 0.03, 0.08, 0.12] result = holm_bonferroni(p_values, alpha=0.05) print(f"Rejected: {result['rejected']}") Rejected: [True, True, False, False, False]
Source code in src/ml4t/diagnostic/evaluation/stats/false_discovery_rate.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | |
multiple_testing_summary
¶
Summarize results from multiple statistical tests with corrections.
Provides a comprehensive summary of multiple hypothesis testing results with appropriate corrections for multiple comparisons.
Parameters¶
test_results : Sequence[dict] List of test result dictionaries (each should have 'p_value' key) method : str, default "benjamini_hochberg" Multiple testing correction method alpha : float, default 0.05 Significance level
Returns¶
dict Summary with original and corrected results
Examples¶
results = [{'name': 'Strategy A', 'p_value': 0.01}, ... {'name': 'Strategy B', 'p_value': 0.08}] summary = multiple_testing_summary(results) print(f"Significant after correction: {summary['n_significant_corrected']}")
Source code in src/ml4t/diagnostic/evaluation/stats/false_discovery_rate.py
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 | |
robust_ic
¶
Calculate Information Coefficient with robust standard errors.
Uses stationary bootstrap [1]_ to compute standard errors that properly account for temporal dependence in time series data.
The stationary bootstrap is the correct method because: 1. Preserves temporal dependence structure 2. No asymptotic approximations required 3. Theoretically valid for rank correlation (Spearman IC)
Parameters¶
predictions : Union[pl.Series, pd.Series, NDArray] Model predictions or scores returns : Union[pl.Series, pd.Series, NDArray] Forward returns corresponding to predictions n_samples : int, default 1000 Number of bootstrap samples return_details : bool, default False Whether to return detailed statistics
Returns¶
Union[dict, float] If return_details=False: t-statistic (IC / bootstrap_std) If return_details=True: dict with 'ic', 'bootstrap_std', 't_stat', 'p_value', 'ci_lower', 'ci_upper'
Examples¶
predictions = np.random.randn(252) returns = 0.1 * predictions + np.random.randn(252) * 0.5 result = robust_ic(predictions, returns, return_details=True) print(f"IC: {result['ic']:.3f}, t-stat: {result['t_stat']:.3f}")
References¶
.. [1] Politis, D.N. & Romano, J.P. (1994). "The Stationary Bootstrap." Journal of the American Statistical Association 89:1303-1313.
Source code in src/ml4t/diagnostic/evaluation/stats/hac_standard_errors.py
whites_reality_check
¶
whites_reality_check(
returns_benchmark,
returns_strategies,
bootstrap_samples=1000,
block_size=None,
random_state=None,
)
Perform White's Reality Check for multiple strategy comparison.
Tests whether any strategy significantly outperforms a benchmark after adjusting for multiple comparisons and data mining bias. Uses stationary bootstrap to preserve temporal dependencies.
Parameters¶
returns_benchmark : Union[pl.Series, pd.Series, NDArray] Benchmark strategy returns returns_strategies : Union[pd.DataFrame, pl.DataFrame, NDArray] Returns for multiple strategies being tested bootstrap_samples : int, default 1000 Number of bootstrap samples for null distribution block_size : Optional[int], default None Block size for stationary bootstrap. If None, uses optimal size random_state : Optional[int], default None Random seed for reproducible results
Returns¶
dict Dictionary with 'test_statistic', 'p_value', 'critical_values', 'best_strategy_performance', 'null_distribution'
Notes¶
Test Hypothesis: - H0: No strategy beats the benchmark (max E[r_i - r_benchmark] <= 0) - H1: At least one strategy beats the benchmark
Interpretation: - p_value < 0.05: Reject H0, at least one strategy beats benchmark - p_value >= 0.05: Cannot reject H0, no evidence of outperformance
Examples¶
benchmark_returns = np.random.normal(0.001, 0.02, 252) strategy_returns = np.random.normal(0.002, 0.02, (252, 10)) result = whites_reality_check(benchmark_returns, strategy_returns) print(f"Reality Check p-value: {result['p_value']:.3f}")
References¶
White, H. (2000). "A Reality Check for Data Snooping." Econometrica, 68(5), 1097-1126.
Source code in src/ml4t/diagnostic/evaluation/stats/reality_check.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | |
Integration¶
The integration surface focuses on contracts and the ml4t-backtest bridge:
| Category | Objects |
|---|---|
| Contracts | TradeRecord, DataQualityReport, DataQualityMetrics, DataAnomaly, BacktestReportMetadata |
| Backtest bridge | compute_metrics_from_result, analyze_backtest_result, portfolio_analysis_from_result |
| Tearsheet generation | generate_tearsheet_from_result, profile_from_run_artifacts, generate_tearsheet_from_run_artifacts |
Visualization¶
The visualization namespace is Plotly-first and grouped by workflow:
| Area | Representative functions |
|---|---|
| Cross-validation | plot_cv_folds |
| Signal analysis | plot_ic_ts, plot_quantile_returns_bar, SignalDashboard, MultiSignalDashboard |
| Portfolio analysis | create_portfolio_dashboard, plot_portfolio_cumulative_returns, plot_monthly_returns_heatmap, plot_drawdown_underwater, plot_rolling_sharpe |
| Factor analysis | plot_factor_betas_bar, plot_rolling_betas, plot_return_attribution_waterfall |
| Reporting | combine_figures_to_html, generate_combined_report, export_figures_to_pdf |
For a package-layout overview, see the Architecture page.