ML4T Diagnostic
ML4T Diagnostic Documentation
Feature validation, strategy diagnostics, and Deflated Sharpe Ratio
Skip to content

Feature Selection

Reduce a large set of candidate features to a focused, production-ready set using systematic filtering.

Use this page after you have already computed feature-outcome diagnostics such as IC, ML importance, and drift. The goal is to turn those diagnostics into a repeatable selection pipeline that you can defend and rerun.


Quick Start

I have feature-outcome analysis results (IC, importance, drift). I want to select the best features for my ML model.

from ml4t.diagnostic.selection import (
    FeatureSelector,
    FeatureOutcomeResult,
    FeatureICResults,
    FeatureImportanceResults,
)

# Build outcome results from your analysis
outcome = FeatureOutcomeResult(
    features=feature_names,
    ic_results=ic_results,           # dict[str, FeatureICResults]
    importance_results=imp_results,  # dict[str, FeatureImportanceResults]
    drift_results=drift_results,     # from analyze_drift() — optional
)

# Create selector with correlation matrix
selector = FeatureSelector(outcome, correlation_matrix=corr_df)

# Run pipeline
selector.run_pipeline([
    ("drift", {"threshold": 0.2, "method": "psi"}),
    ("ic", {"threshold": 0.02}),
    ("correlation", {"threshold": 0.8}),
    ("importance", {"threshold": 0.01, "method": "mdi", "top_k": 20}),
])

selected = selector.get_selected_features()
print(selector.get_selection_report().summary())

Pipeline Stages

The FeatureSelector provides four filtering methods. Apply them individually or chain them with run_pipeline().

IC Filtering

Keep features with absolute Information Coefficient above a threshold. IC measures the Spearman rank correlation between a feature and forward returns.

selector.filter_by_ic(
    threshold=0.02,   # minimum |IC|
    min_periods=20,   # minimum observations
    lag=5,            # specific forward lag (None = mean across lags)
)
Parameter Default Description
threshold Minimum absolute IC to keep
min_periods 1 Minimum observation count
lag None Specific lag to filter on (None = mean IC)

Importance Filtering

Keep features by ML importance score. Supports MDI, permutation, and SHAP.

# Threshold-based
selector.filter_by_importance(threshold=0.01, method="mdi")

# Top-K (keeps the K most important, ignoring threshold)
selector.filter_by_importance(threshold=0, method="shap", top_k=20)
Parameter Default Description
threshold Minimum importance value
method "mdi" "mdi", "permutation", or "shap"
top_k None Keep only top K features

Correlation Filtering

Remove redundant features. When two features exceed the correlation threshold, keep the one with higher IC (or importance, or alphabetical order).

selector.filter_by_correlation(
    threshold=0.8,
    keep_strategy="higher_ic",  # "higher_ic", "higher_importance", or "first"
)

The correlation matrix can be:

  • A Polars DataFrame with a "feature" index column
  • A Polars DataFrame where column names are feature names (e.g. from df.corr())

Drift Filtering

Remove features with unstable distributions. Requires drift_results in the outcome.

# PSI-based: removes features with red alert (PSI >= 0.2)
selector.filter_by_drift(method="psi")

# Consensus-based: removes features where majority of methods detect drift
selector.filter_by_drift(threshold=0.5, method="consensus")

Drift results come from analyze_drift():

from ml4t.diagnostic.evaluation.drift import analyze_drift

drift = analyze_drift(train_df.to_pandas(), test_df.to_pandas(), methods=["psi"])
outcome = FeatureOutcomeResult(features=names, drift_results=drift, ...)

Method Chaining

All filter methods return self, so you can chain them:

selected = (
    FeatureSelector(outcome, corr_matrix)
    .filter_by_drift(method="psi")
    .filter_by_ic(threshold=0.02)
    .filter_by_correlation(threshold=0.8)
    .filter_by_importance(threshold=0.05, method="mdi")
    .get_selected_features()
)

Selection Report

After filtering, generate a report showing each step:

report = selector.get_selection_report()
print(report.summary())

Output:

======================================================================
Feature Selection Report
======================================================================
Initial Features: 76
Final Features: 12
Removed: 64 (84.2%)

Selection Pipeline:
----------------------------------------------------------------------

Step 1: Drift Filtering: 76 → 74 (2 removed, 2.6%)
  Parameters: {'threshold': 0.2, 'method': 'psi'}
  Reasoning: Removed features with psi drift >= 0.2

Step 2: IC Filtering: 74 → 45 (29 removed, 39.2%)
  Parameters: {'threshold': 0.02, 'min_periods': 1, 'lag': None}
  Reasoning: Removed features with |IC| < 0.02

Step 3: Correlation Filtering: 45 → 32 (13 removed, 28.9%)
  Parameters: {'threshold': 0.8, 'keep_strategy': 'higher_ic'}
  Reasoning: Removed features with correlation > 0.8

Step 4: Importance Filtering (MDI): 32 → 12 (20 removed, 62.5%)
  Parameters: {'threshold': 0, 'method': 'mdi', 'top_k': 12}
  Reasoning: Kept top 12 features by mdi importance
======================================================================

Building FeatureOutcomeResult

The FeatureOutcomeResult aggregates IC, importance, and drift analysis into the interface that FeatureSelector consumes.

From manual analysis

from ml4t.diagnostic.selection.types import (
    FeatureICResults,
    FeatureImportanceResults,
    FeatureOutcomeResult,
)

# After computing IC cross-sectionally
ic_results = {}
for feature in features:
    ic_results[feature] = FeatureICResults(
        feature=feature,
        ic_mean=mean_ic,
        ic_std=std_ic,
        ic_ir=mean_ic / std_ic,
        t_stat=t_stat,
        p_value=p_val,
        ic_by_lag={1: ic_1d, 5: ic_5d, 21: ic_21d},
        n_observations=n_obs,
    )

# After fitting a tree model
importance_results = {}
for i, feature in enumerate(features):
    importance_results[feature] = FeatureImportanceResults(
        feature=feature,
        mdi_importance=model.feature_importances_[i],
        permutation_importance=perm_imp[i],
        permutation_std=perm_std[i],
    )

outcome = FeatureOutcomeResult(
    features=features,
    ic_results=ic_results,
    importance_results=importance_results,
)

With drift detection

from ml4t.diagnostic.evaluation.drift import analyze_drift

drift = analyze_drift(
    reference=train_features.to_pandas(),
    test=test_features.to_pandas(),
    methods=["psi", "wasserstein"],
)

outcome = FeatureOutcomeResult(
    features=features,
    ic_results=ic_results,
    importance_results=importance_results,
    drift_results=drift,
)

Reset and Re-run

selector.reset()  # restores initial feature set, clears history

# Try a different pipeline
selector.run_pipeline([
    ("ic", {"threshold": 0.03}),
    ("importance", {"threshold": 0, "method": "shap", "top_k": 10}),
])

API Reference

FeatureSelector

FeatureSelector(
    outcome_results,
    correlation_matrix=None,
    initial_features=None,
)

Systematic feature selection with multiple filtering criteria.

Combines IC analysis, importance scoring, correlation filtering, and drift detection to select the most promising features for ML models.

Parameters

outcome_results : FeatureOutcomeResult Results from feature-outcome analysis (IC, importance, drift). correlation_matrix : pl.DataFrame, optional Feature correlation matrix. initial_features : list[str], optional Initial set of features to select from. If None, uses all features from outcome_results.

Attributes

selected_features : set[str] Current set of selected features (updated by filters) removed_features : set[str] Features removed by filters selection_steps : list[SelectionStep] History of selection steps applied

Source code in src/ml4t/diagnostic/selection/systematic.py
def __init__(
    self,
    outcome_results: FeatureOutcomeResult,
    correlation_matrix: pl.DataFrame | None = None,
    initial_features: list[str] | None = None,
):
    self.outcome_results = outcome_results
    self.correlation_matrix = correlation_matrix

    if initial_features is not None:
        self.initial_features = set(initial_features)
    else:
        self.initial_features = set(outcome_results.features)

    self.selected_features = self.initial_features.copy()
    self.removed_features: set[str] = set()
    self.selection_steps: list[SelectionStep] = []

filter_by_ic

filter_by_ic(threshold, min_periods=1, lag=None)

Filter features by Information Coefficient.

Keeps features with |IC| > threshold.

Parameters

threshold : float Minimum absolute IC value to keep a feature. min_periods : int, default 1 Minimum number of observations required. lag : int | None, default None Specific forward lag to use. If None, uses mean IC.

Returns

self : FeatureSelector Returns self for method chaining.

Source code in src/ml4t/diagnostic/selection/systematic.py
def filter_by_ic(
    self,
    threshold: float,
    min_periods: int = 1,
    lag: int | None = None,
) -> FeatureSelector:
    """Filter features by Information Coefficient.

    Keeps features with |IC| > threshold.

    Parameters
    ----------
    threshold : float
        Minimum absolute IC value to keep a feature.
    min_periods : int, default 1
        Minimum number of observations required.
    lag : int | None, default None
        Specific forward lag to use. If None, uses mean IC.

    Returns
    -------
    self : FeatureSelector
        Returns self for method chaining.
    """
    features_before = len(self.selected_features)
    features_to_remove = []

    for feature in self.selected_features:
        if feature not in self.outcome_results.ic_results:
            continue

        ic_result = self.outcome_results.ic_results[feature]

        if ic_result.n_observations < min_periods:
            features_to_remove.append(feature)
            continue

        if lag is not None:
            if lag not in ic_result.ic_by_lag:
                features_to_remove.append(feature)
                continue
            ic_value = abs(ic_result.ic_by_lag[lag])
        else:
            ic_value = abs(ic_result.ic_mean)

        if ic_value < threshold:
            features_to_remove.append(feature)

    self.selected_features -= set(features_to_remove)
    self.removed_features |= set(features_to_remove)

    step = SelectionStep(
        step_name="IC Filtering",
        parameters={"threshold": threshold, "min_periods": min_periods, "lag": lag},
        features_before=features_before,
        features_after=len(self.selected_features),
        features_removed=features_to_remove,
        features_kept=list(self.selected_features),
        reasoning=f"Removed features with |IC| < {threshold}",
    )
    self.selection_steps.append(step)

    return self

filter_by_importance

filter_by_importance(threshold, method='mdi', top_k=None)

Filter features by ML importance scores.

Parameters

threshold : float Minimum importance value to keep a feature. method : {"mdi", "permutation", "shap"}, default "mdi" Importance method to use. top_k : int | None, default None If provided, keeps only the top K most important features.

Returns

self : FeatureSelector Returns self for method chaining.

Source code in src/ml4t/diagnostic/selection/systematic.py
def filter_by_importance(
    self,
    threshold: float,
    method: Literal["mdi", "permutation", "shap"] = "mdi",
    top_k: int | None = None,
) -> FeatureSelector:
    """Filter features by ML importance scores.

    Parameters
    ----------
    threshold : float
        Minimum importance value to keep a feature.
    method : {"mdi", "permutation", "shap"}, default "mdi"
        Importance method to use.
    top_k : int | None, default None
        If provided, keeps only the top K most important features.

    Returns
    -------
    self : FeatureSelector
        Returns self for method chaining.
    """
    features_before = len(self.selected_features)

    feature_importance = []
    for feature in self.selected_features:
        if feature not in self.outcome_results.importance_results:
            continue

        imp_result = self.outcome_results.importance_results[feature]

        if method == "mdi":
            importance = imp_result.mdi_importance
        elif method == "permutation":
            importance = imp_result.permutation_importance
        elif method == "shap":
            if imp_result.shap_mean is None:
                continue
            importance = imp_result.shap_mean
        else:
            raise ValueError(
                f"Unknown importance method: {method}. Choose from 'mdi', 'permutation', 'shap'"
            )

        feature_importance.append((feature, importance))

    feature_importance.sort(key=lambda x: x[1], reverse=True)

    if top_k is not None:
        features_to_keep = [f for f, _ in feature_importance[:top_k]]
        reasoning = f"Kept top {top_k} features by {method} importance"
    else:
        features_to_keep = [f for f, imp in feature_importance if imp >= threshold]
        reasoning = f"Removed features with {method} importance < {threshold}"

    features_to_remove = [f for f in self.selected_features if f not in features_to_keep]
    self.selected_features = set(features_to_keep)
    self.removed_features |= set(features_to_remove)

    step = SelectionStep(
        step_name=f"Importance Filtering ({method.upper()})",
        parameters={"threshold": threshold, "method": method, "top_k": top_k},
        features_before=features_before,
        features_after=len(self.selected_features),
        features_removed=features_to_remove,
        features_kept=list(self.selected_features),
        reasoning=reasoning,
    )
    self.selection_steps.append(step)

    return self

filter_by_correlation

filter_by_correlation(threshold, keep_strategy='higher_ic')

Remove highly correlated features to reduce redundancy.

When two features have correlation > threshold, keeps one based on the keep_strategy.

Parameters

threshold : float Maximum absolute correlation allowed between features. keep_strategy : {"higher_ic", "higher_importance", "first"}, default "higher_ic" Strategy for choosing which feature to keep.

Returns

self : FeatureSelector Returns self for method chaining.

Raises

ValueError If correlation_matrix was not provided.

Source code in src/ml4t/diagnostic/selection/systematic.py
def filter_by_correlation(
    self,
    threshold: float,
    keep_strategy: Literal["higher_ic", "higher_importance", "first"] = "higher_ic",
) -> FeatureSelector:
    """Remove highly correlated features to reduce redundancy.

    When two features have correlation > threshold, keeps one based on
    the keep_strategy.

    Parameters
    ----------
    threshold : float
        Maximum absolute correlation allowed between features.
    keep_strategy : {"higher_ic", "higher_importance", "first"}, default "higher_ic"
        Strategy for choosing which feature to keep.

    Returns
    -------
    self : FeatureSelector
        Returns self for method chaining.

    Raises
    ------
    ValueError
        If correlation_matrix was not provided.
    """
    if self.correlation_matrix is None:
        raise ValueError(
            "Correlation matrix required for correlation filtering. "
            "Provide correlation_matrix during FeatureSelector initialization."
        )

    features_before = len(self.selected_features)
    features_to_remove: set[str] = set()

    # Build correlation lookup dict from Polars DataFrame
    corr_matrix = self.correlation_matrix
    if "feature" in corr_matrix.columns:
        feature_names = corr_matrix["feature"].to_list()
        value_columns = [c for c in corr_matrix.columns if c != "feature"]
    else:
        feature_names = value_columns = list(corr_matrix.columns)

    # Build {(feat1, feat2): correlation} lookup for O(1) access
    corr_lookup: dict[tuple[str, str], float] = {}
    feature_set = set(feature_names)
    for row_idx, row_name in enumerate(feature_names):
        row_data = corr_matrix.row(row_idx)
        col_offset = 1 if "feature" in corr_matrix.columns else 0
        for col_idx, col_name in enumerate(value_columns):
            corr_lookup[(row_name, col_name)] = row_data[col_idx + col_offset]

    selected_list = sorted(self.selected_features)
    selected_list = [f for f in selected_list if f in feature_set]

    if len(selected_list) < 2:
        step = SelectionStep(
            step_name="Correlation Filtering",
            parameters={"threshold": threshold, "keep_strategy": keep_strategy},
            features_before=features_before,
            features_after=features_before,
            features_removed=[],
            features_kept=list(self.selected_features),
            reasoning="Insufficient features for correlation filtering",
        )
        self.selection_steps.append(step)
        return self

    for i, feat1 in enumerate(selected_list):
        if feat1 in features_to_remove:
            continue

        for feat2 in selected_list[i + 1 :]:
            if feat2 in features_to_remove:
                continue

            corr_value = abs(corr_lookup[(feat1, feat2)])

            if corr_value > threshold:
                if keep_strategy == "higher_ic":
                    ic_results = self.outcome_results.ic_results
                    ic1 = abs(ic_results[feat1].ic_mean) if feat1 in ic_results else 0.0
                    ic2 = abs(ic_results[feat2].ic_mean) if feat2 in ic_results else 0.0
                    to_remove = feat2 if ic1 > ic2 else feat1

                elif keep_strategy == "higher_importance":
                    imp_results = self.outcome_results.importance_results
                    imp1 = imp_results[feat1].mdi_importance if feat1 in imp_results else 0.0
                    imp2 = imp_results[feat2].mdi_importance if feat2 in imp_results else 0.0
                    to_remove = feat2 if imp1 > imp2 else feat1

                else:  # "first"
                    to_remove = feat2

                features_to_remove.add(to_remove)

    self.selected_features -= features_to_remove
    self.removed_features |= features_to_remove

    step = SelectionStep(
        step_name="Correlation Filtering",
        parameters={"threshold": threshold, "keep_strategy": keep_strategy},
        features_before=features_before,
        features_after=len(self.selected_features),
        features_removed=list(features_to_remove),
        features_kept=list(self.selected_features),
        reasoning=(
            f"Removed features with correlation > {threshold} using {keep_strategy} strategy"
        ),
    )
    self.selection_steps.append(step)

    return self

filter_by_drift

filter_by_drift(threshold=0.2, method='psi')

Remove features with unstable distributions (drift).

Parameters

threshold : float, default 0.2 Drift threshold. For PSI: >= 0.2 indicates significant drift. For consensus: drift_probability >= threshold. method : {"psi", "consensus"}, default "psi" Drift detection method.

Returns

self : FeatureSelector Returns self for method chaining.

Raises

ValueError If drift_results not available in outcome_results.

Source code in src/ml4t/diagnostic/selection/systematic.py
def filter_by_drift(
    self,
    threshold: float = 0.2,
    method: Literal["psi", "consensus"] = "psi",
) -> FeatureSelector:
    """Remove features with unstable distributions (drift).

    Parameters
    ----------
    threshold : float, default 0.2
        Drift threshold. For PSI: >= 0.2 indicates significant drift.
        For consensus: drift_probability >= threshold.
    method : {"psi", "consensus"}, default "psi"
        Drift detection method.

    Returns
    -------
    self : FeatureSelector
        Returns self for method chaining.

    Raises
    ------
    ValueError
        If drift_results not available in outcome_results.
    """
    if self.outcome_results.drift_results is None:
        raise ValueError(
            "Drift results not available. Run outcome analysis with drift_detection=True."
        )

    features_before = len(self.selected_features)
    features_to_remove = []

    drift_results = self.outcome_results.drift_results

    for feature_result in drift_results.feature_results:
        feature = feature_result.feature

        if feature not in self.selected_features:
            continue

        if method == "psi":
            if (
                feature_result.psi_result is not None
                and feature_result.psi_result.alert_level == "red"
            ):
                features_to_remove.append(feature)

        elif method == "consensus":
            if feature_result.drift_probability >= threshold:
                features_to_remove.append(feature)

        else:
            raise ValueError(f"Unknown drift method: {method}. Choose from 'psi', 'consensus'")

    self.selected_features -= set(features_to_remove)
    self.removed_features |= set(features_to_remove)

    step = SelectionStep(
        step_name="Drift Filtering",
        parameters={"threshold": threshold, "method": method},
        features_before=features_before,
        features_after=len(self.selected_features),
        features_removed=features_to_remove,
        features_kept=list(self.selected_features),
        reasoning=f"Removed features with {method} drift >= {threshold}",
    )
    self.selection_steps.append(step)

    return self

run_pipeline

run_pipeline(steps)

Execute multiple selection filters in sequence.

Parameters

steps : list[tuple[str, dict]] List of (filter_name, parameters) tuples. Valid filter names: "ic", "importance", "correlation", "drift".

Returns

self : FeatureSelector Returns self for method chaining.

Source code in src/ml4t/diagnostic/selection/systematic.py
def run_pipeline(
    self,
    steps: list[tuple[str, dict[str, Any]]],
) -> FeatureSelector:
    """Execute multiple selection filters in sequence.

    Parameters
    ----------
    steps : list[tuple[str, dict]]
        List of (filter_name, parameters) tuples.
        Valid filter names: "ic", "importance", "correlation", "drift".

    Returns
    -------
    self : FeatureSelector
        Returns self for method chaining.
    """
    for filter_name, params in steps:
        if filter_name == "ic":
            self.filter_by_ic(**params)
        elif filter_name == "importance":
            self.filter_by_importance(**params)
        elif filter_name == "correlation":
            self.filter_by_correlation(**params)
        elif filter_name == "drift":
            self.filter_by_drift(**params)
        else:
            raise ValueError(
                f"Unknown filter: {filter_name}. "
                "Valid filters: ic, importance, correlation, drift"
            )

    return self

get_selected_features

get_selected_features()

Get current list of selected features (sorted).

Source code in src/ml4t/diagnostic/selection/systematic.py
def get_selected_features(self) -> list[str]:
    """Get current list of selected features (sorted)."""
    return sorted(self.selected_features)

get_removed_features

get_removed_features()

Get list of features that were removed (sorted).

Source code in src/ml4t/diagnostic/selection/systematic.py
def get_removed_features(self) -> list[str]:
    """Get list of features that were removed (sorted)."""
    return sorted(self.removed_features)

get_selection_report

get_selection_report()

Generate comprehensive selection report.

Source code in src/ml4t/diagnostic/selection/systematic.py
def get_selection_report(self) -> SelectionReport:
    """Generate comprehensive selection report."""
    return SelectionReport(
        initial_features=sorted(self.initial_features),
        final_features=self.get_selected_features(),
        steps=self.selection_steps,
    )

reset

reset()

Reset selector to initial feature set.

Source code in src/ml4t/diagnostic/selection/systematic.py
def reset(self) -> FeatureSelector:
    """Reset selector to initial feature set."""
    self.selected_features = self.initial_features.copy()
    self.removed_features = set()
    self.selection_steps = []
    return self

FeatureOutcomeResult dataclass

FeatureOutcomeResult(
    features,
    ic_results=dict(),
    importance_results=dict(),
    drift_results=None,
)

Aggregated feature-outcome analysis results.

Combines IC analysis, importance scoring, and drift detection for a set of features. This is the primary input to FeatureSelector.

Attributes:

Name Type Description
features list[str]

List of feature names analyzed

ic_results dict[str, FeatureICResults]

IC analysis per feature (keyed by feature name)

importance_results dict[str, FeatureImportanceResults]

Importance results per feature (keyed by feature name)

drift_results DriftSummaryResult | None

Optional drift detection results

FeatureICResults dataclass

FeatureICResults(
    feature,
    ic_mean,
    ic_std,
    ic_ir,
    t_stat,
    p_value,
    ic_by_lag,
    n_observations,
)

IC analysis results for a single feature.

Attributes:

Name Type Description
feature str

Feature name

ic_mean float

Mean information coefficient across periods

ic_std float

Standard deviation of IC

ic_ir float

Information Ratio (ic_mean / ic_std)

t_stat float

T-statistic for IC significance

p_value float

P-value for IC significance

ic_by_lag dict[int, float]

IC values at specific forward lags

n_observations int

Number of observations used

FeatureImportanceResults dataclass

FeatureImportanceResults(
    feature,
    mdi_importance,
    permutation_importance,
    permutation_std,
    shap_mean=None,
    shap_std=None,
    rank_mdi=0,
    rank_permutation=0,
)

Feature importance results from ML models.

Attributes:

Name Type Description
feature str

Feature name

mdi_importance float

Mean Decrease in Impurity importance

permutation_importance float

Permutation importance

permutation_std float

Standard deviation of permutation importance

shap_mean float | None

Mean absolute SHAP value (None if not computed)

shap_std float | None

Standard deviation of SHAP values (None if not computed)

rank_mdi int

Rank by MDI importance (1 = most important)

rank_permutation int

Rank by permutation importance


See It In The Book

The FeatureSelector pipeline is demonstrated in the book at multiple levels:

  • Teaching demo: code/08_feature_engineering/05_feature_selection.py — builds FeatureOutcomeResult from scratch and runs the full IC → correlation → importance pipeline.

  • Production usage: Each case study evaluation notebook includes a "Library Convenience Functions" section comparing FeatureSelector output to the manual triage logic:

Case Study Notebook IC Threshold Entity
CME Futures cme_futures/code/05_evaluation.py 0.008 symbol
ETFs etfs/code/05_evaluation.py 0.01 symbol
US Equities us_equities_panel/code/05_evaluation.py 0.003 symbol
US Firm Chars us_firm_characteristics/code/05_evaluation.py 0.01 stock_id
Crypto Perps crypto_perps_funding/code/05_evaluation.py 0.005 symbol
FX Pairs fx_pairs/code/05_evaluation.py 0.005 symbol
Nasdaq100 nasdaq100_microstructure/code/05_evaluation.py 0.003 symbol

For the broader chapter and case-study map, see the Book Guide.


Next Steps

  • Feature Diagnostics - Generate the IC, distribution, and robustness inputs used here
  • Statistical Tests - Check significance and multiple-testing corrections before promoting features
  • Workflows - Place feature triage inside a full research pipeline
  • Book Guide - Find the matching notebook and case-study implementations