Feature Selection¶
Reduce a large set of candidate features to a focused, production-ready set using systematic filtering.
Use this page after you have already computed feature-outcome diagnostics such as IC, ML importance, and drift. The goal is to turn those diagnostics into a repeatable selection pipeline that you can defend and rerun.
Quick Start¶
I have feature-outcome analysis results (IC, importance, drift). I want to select the best features for my ML model.
from ml4t.diagnostic.selection import (
FeatureSelector,
FeatureOutcomeResult,
FeatureICResults,
FeatureImportanceResults,
)
# Build outcome results from your analysis
outcome = FeatureOutcomeResult(
features=feature_names,
ic_results=ic_results, # dict[str, FeatureICResults]
importance_results=imp_results, # dict[str, FeatureImportanceResults]
drift_results=drift_results, # from analyze_drift() — optional
)
# Create selector with correlation matrix
selector = FeatureSelector(outcome, correlation_matrix=corr_df)
# Run pipeline
selector.run_pipeline([
("drift", {"threshold": 0.2, "method": "psi"}),
("ic", {"threshold": 0.02}),
("correlation", {"threshold": 0.8}),
("importance", {"threshold": 0.01, "method": "mdi", "top_k": 20}),
])
selected = selector.get_selected_features()
print(selector.get_selection_report().summary())
Pipeline Stages¶
The FeatureSelector provides four filtering methods. Apply them individually
or chain them with run_pipeline().
IC Filtering¶
Keep features with absolute Information Coefficient above a threshold. IC measures the Spearman rank correlation between a feature and forward returns.
selector.filter_by_ic(
threshold=0.02, # minimum |IC|
min_periods=20, # minimum observations
lag=5, # specific forward lag (None = mean across lags)
)
| Parameter | Default | Description |
|---|---|---|
threshold |
— | Minimum absolute IC to keep |
min_periods |
1 | Minimum observation count |
lag |
None | Specific lag to filter on (None = mean IC) |
Importance Filtering¶
Keep features by ML importance score. Supports MDI, permutation, and SHAP.
# Threshold-based
selector.filter_by_importance(threshold=0.01, method="mdi")
# Top-K (keeps the K most important, ignoring threshold)
selector.filter_by_importance(threshold=0, method="shap", top_k=20)
| Parameter | Default | Description |
|---|---|---|
threshold |
— | Minimum importance value |
method |
"mdi" |
"mdi", "permutation", or "shap" |
top_k |
None | Keep only top K features |
Correlation Filtering¶
Remove redundant features. When two features exceed the correlation threshold, keep the one with higher IC (or importance, or alphabetical order).
selector.filter_by_correlation(
threshold=0.8,
keep_strategy="higher_ic", # "higher_ic", "higher_importance", or "first"
)
The correlation matrix can be:
- A Polars DataFrame with a
"feature"index column - A Polars DataFrame where column names are feature names (e.g. from
df.corr())
Drift Filtering¶
Remove features with unstable distributions. Requires drift_results in
the outcome.
# PSI-based: removes features with red alert (PSI >= 0.2)
selector.filter_by_drift(method="psi")
# Consensus-based: removes features where majority of methods detect drift
selector.filter_by_drift(threshold=0.5, method="consensus")
Drift results come from analyze_drift():
from ml4t.diagnostic.evaluation.drift import analyze_drift
drift = analyze_drift(train_df.to_pandas(), test_df.to_pandas(), methods=["psi"])
outcome = FeatureOutcomeResult(features=names, drift_results=drift, ...)
Method Chaining¶
All filter methods return self, so you can chain them:
selected = (
FeatureSelector(outcome, corr_matrix)
.filter_by_drift(method="psi")
.filter_by_ic(threshold=0.02)
.filter_by_correlation(threshold=0.8)
.filter_by_importance(threshold=0.05, method="mdi")
.get_selected_features()
)
Selection Report¶
After filtering, generate a report showing each step:
Output:
======================================================================
Feature Selection Report
======================================================================
Initial Features: 76
Final Features: 12
Removed: 64 (84.2%)
Selection Pipeline:
----------------------------------------------------------------------
Step 1: Drift Filtering: 76 → 74 (2 removed, 2.6%)
Parameters: {'threshold': 0.2, 'method': 'psi'}
Reasoning: Removed features with psi drift >= 0.2
Step 2: IC Filtering: 74 → 45 (29 removed, 39.2%)
Parameters: {'threshold': 0.02, 'min_periods': 1, 'lag': None}
Reasoning: Removed features with |IC| < 0.02
Step 3: Correlation Filtering: 45 → 32 (13 removed, 28.9%)
Parameters: {'threshold': 0.8, 'keep_strategy': 'higher_ic'}
Reasoning: Removed features with correlation > 0.8
Step 4: Importance Filtering (MDI): 32 → 12 (20 removed, 62.5%)
Parameters: {'threshold': 0, 'method': 'mdi', 'top_k': 12}
Reasoning: Kept top 12 features by mdi importance
======================================================================
Building FeatureOutcomeResult¶
The FeatureOutcomeResult aggregates IC, importance, and drift analysis
into the interface that FeatureSelector consumes.
From manual analysis¶
from ml4t.diagnostic.selection.types import (
FeatureICResults,
FeatureImportanceResults,
FeatureOutcomeResult,
)
# After computing IC cross-sectionally
ic_results = {}
for feature in features:
ic_results[feature] = FeatureICResults(
feature=feature,
ic_mean=mean_ic,
ic_std=std_ic,
ic_ir=mean_ic / std_ic,
t_stat=t_stat,
p_value=p_val,
ic_by_lag={1: ic_1d, 5: ic_5d, 21: ic_21d},
n_observations=n_obs,
)
# After fitting a tree model
importance_results = {}
for i, feature in enumerate(features):
importance_results[feature] = FeatureImportanceResults(
feature=feature,
mdi_importance=model.feature_importances_[i],
permutation_importance=perm_imp[i],
permutation_std=perm_std[i],
)
outcome = FeatureOutcomeResult(
features=features,
ic_results=ic_results,
importance_results=importance_results,
)
With drift detection¶
from ml4t.diagnostic.evaluation.drift import analyze_drift
drift = analyze_drift(
reference=train_features.to_pandas(),
test=test_features.to_pandas(),
methods=["psi", "wasserstein"],
)
outcome = FeatureOutcomeResult(
features=features,
ic_results=ic_results,
importance_results=importance_results,
drift_results=drift,
)
Reset and Re-run¶
selector.reset() # restores initial feature set, clears history
# Try a different pipeline
selector.run_pipeline([
("ic", {"threshold": 0.03}),
("importance", {"threshold": 0, "method": "shap", "top_k": 10}),
])
API Reference¶
FeatureSelector
¶
Systematic feature selection with multiple filtering criteria.
Combines IC analysis, importance scoring, correlation filtering, and drift detection to select the most promising features for ML models.
Parameters¶
outcome_results : FeatureOutcomeResult Results from feature-outcome analysis (IC, importance, drift). correlation_matrix : pl.DataFrame, optional Feature correlation matrix. initial_features : list[str], optional Initial set of features to select from. If None, uses all features from outcome_results.
Attributes¶
selected_features : set[str] Current set of selected features (updated by filters) removed_features : set[str] Features removed by filters selection_steps : list[SelectionStep] History of selection steps applied
Source code in src/ml4t/diagnostic/selection/systematic.py
filter_by_ic
¶
Filter features by Information Coefficient.
Keeps features with |IC| > threshold.
Parameters¶
threshold : float Minimum absolute IC value to keep a feature. min_periods : int, default 1 Minimum number of observations required. lag : int | None, default None Specific forward lag to use. If None, uses mean IC.
Returns¶
self : FeatureSelector Returns self for method chaining.
Source code in src/ml4t/diagnostic/selection/systematic.py
filter_by_importance
¶
Filter features by ML importance scores.
Parameters¶
threshold : float Minimum importance value to keep a feature. method : {"mdi", "permutation", "shap"}, default "mdi" Importance method to use. top_k : int | None, default None If provided, keeps only the top K most important features.
Returns¶
self : FeatureSelector Returns self for method chaining.
Source code in src/ml4t/diagnostic/selection/systematic.py
filter_by_correlation
¶
Remove highly correlated features to reduce redundancy.
When two features have correlation > threshold, keeps one based on the keep_strategy.
Parameters¶
threshold : float Maximum absolute correlation allowed between features. keep_strategy : {"higher_ic", "higher_importance", "first"}, default "higher_ic" Strategy for choosing which feature to keep.
Returns¶
self : FeatureSelector Returns self for method chaining.
Raises¶
ValueError If correlation_matrix was not provided.
Source code in src/ml4t/diagnostic/selection/systematic.py
308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 | |
filter_by_drift
¶
Remove features with unstable distributions (drift).
Parameters¶
threshold : float, default 0.2 Drift threshold. For PSI: >= 0.2 indicates significant drift. For consensus: drift_probability >= threshold. method : {"psi", "consensus"}, default "psi" Drift detection method.
Returns¶
self : FeatureSelector Returns self for method chaining.
Raises¶
ValueError If drift_results not available in outcome_results.
Source code in src/ml4t/diagnostic/selection/systematic.py
run_pipeline
¶
Execute multiple selection filters in sequence.
Parameters¶
steps : list[tuple[str, dict]] List of (filter_name, parameters) tuples. Valid filter names: "ic", "importance", "correlation", "drift".
Returns¶
self : FeatureSelector Returns self for method chaining.
Source code in src/ml4t/diagnostic/selection/systematic.py
get_selected_features
¶
get_removed_features
¶
get_selection_report
¶
Generate comprehensive selection report.
Source code in src/ml4t/diagnostic/selection/systematic.py
reset
¶
Reset selector to initial feature set.
FeatureOutcomeResult
dataclass
¶
Aggregated feature-outcome analysis results.
Combines IC analysis, importance scoring, and drift detection for a set of features. This is the primary input to FeatureSelector.
Attributes:
| Name | Type | Description |
|---|---|---|
features |
list[str]
|
List of feature names analyzed |
ic_results |
dict[str, FeatureICResults]
|
IC analysis per feature (keyed by feature name) |
importance_results |
dict[str, FeatureImportanceResults]
|
Importance results per feature (keyed by feature name) |
drift_results |
DriftSummaryResult | None
|
Optional drift detection results |
FeatureICResults
dataclass
¶
IC analysis results for a single feature.
Attributes:
| Name | Type | Description |
|---|---|---|
feature |
str
|
Feature name |
ic_mean |
float
|
Mean information coefficient across periods |
ic_std |
float
|
Standard deviation of IC |
ic_ir |
float
|
Information Ratio (ic_mean / ic_std) |
t_stat |
float
|
T-statistic for IC significance |
p_value |
float
|
P-value for IC significance |
ic_by_lag |
dict[int, float]
|
IC values at specific forward lags |
n_observations |
int
|
Number of observations used |
FeatureImportanceResults
dataclass
¶
FeatureImportanceResults(
feature,
mdi_importance,
permutation_importance,
permutation_std,
shap_mean=None,
shap_std=None,
rank_mdi=0,
rank_permutation=0,
)
Feature importance results from ML models.
Attributes:
| Name | Type | Description |
|---|---|---|
feature |
str
|
Feature name |
mdi_importance |
float
|
Mean Decrease in Impurity importance |
permutation_importance |
float
|
Permutation importance |
permutation_std |
float
|
Standard deviation of permutation importance |
shap_mean |
float | None
|
Mean absolute SHAP value (None if not computed) |
shap_std |
float | None
|
Standard deviation of SHAP values (None if not computed) |
rank_mdi |
int
|
Rank by MDI importance (1 = most important) |
rank_permutation |
int
|
Rank by permutation importance |
See It In The Book¶
The FeatureSelector pipeline is demonstrated in the book at multiple levels:
-
Teaching demo:
code/08_feature_engineering/05_feature_selection.py— buildsFeatureOutcomeResultfrom scratch and runs the full IC → correlation → importance pipeline. -
Production usage: Each case study evaluation notebook includes a "Library Convenience Functions" section comparing
FeatureSelectoroutput to the manual triage logic:
| Case Study | Notebook | IC Threshold | Entity |
|---|---|---|---|
| CME Futures | cme_futures/code/05_evaluation.py |
0.008 | symbol |
| ETFs | etfs/code/05_evaluation.py |
0.01 | symbol |
| US Equities | us_equities_panel/code/05_evaluation.py |
0.003 | symbol |
| US Firm Chars | us_firm_characteristics/code/05_evaluation.py |
0.01 | stock_id |
| Crypto Perps | crypto_perps_funding/code/05_evaluation.py |
0.005 | symbol |
| FX Pairs | fx_pairs/code/05_evaluation.py |
0.005 | symbol |
| Nasdaq100 | nasdaq100_microstructure/code/05_evaluation.py |
0.003 | symbol |
For the broader chapter and case-study map, see the Book Guide.
Next Steps¶
- Feature Diagnostics - Generate the IC, distribution, and robustness inputs used here
- Statistical Tests - Check significance and multiple-testing corrections before promoting features
- Workflows - Place feature triage inside a full research pipeline
- Book Guide - Find the matching notebook and case-study implementations