Preprocessing¶

ML4T Engineer provides sklearn-compatible scalers built on Polars for leakage-safe feature preprocessing.

Use this page when your engineered features are not yet on model-friendly scales or when you need train-only transforms without leakage.

Scalers¶

All scalers follow the sklearn pattern: fit() on training data, transform() on any data. This prevents information leakage from test data into training.

StandardScaler¶

Z-score normalization: output has mean=0, std=1.

from ml4t.engineer.preprocessing import StandardScaler

scaler = StandardScaler(
    columns=None,          # None = all numeric columns
    with_mean=True,        # Center to zero mean
    with_std=True,         # Scale to unit variance
    ddof=1,                # Delta degrees of freedom
)

# Fit on training data
train_scaled = scaler.fit_transform(train_df)

# Transform test data using training statistics
test_scaled = scaler.transform(test_df)

Best for: Approximately Gaussian features. Default choice for most ML models.

MinMaxScaler¶

Scale features to a bounded range (default [0, 1]).

from ml4t.engineer.preprocessing import MinMaxScaler

scaler = MinMaxScaler(
    columns=None,
    feature_range=(0.0, 1.0),   # Target range
)

train_scaled = scaler.fit_transform(train_df)
test_scaled = scaler.transform(test_df)

Best for: Neural networks expecting bounded input, or when preserving zero values matters.

RobustScaler¶

IQR-based scaling that's resistant to outliers.

from ml4t.engineer.preprocessing import RobustScaler

scaler = RobustScaler(
    columns=None,
    with_centering=True,            # Subtract median
    with_scaling=True,              # Scale by IQR
    quantile_range=(25.0, 75.0),    # IQR range
)

train_scaled = scaler.fit_transform(train_df)
test_scaled = scaler.transform(test_df)

Best for: Financial data with fat tails, outliers, or extreme values.

When to Use Each Scaler¶

Scaler	Use When	Sensitive To
`StandardScaler`	Data is approximately Gaussian	Outliers
`MinMaxScaler`	Need bounded 0-1 range	Outliers
`RobustScaler`	Data has outliers or fat tails	Nothing (robust)

For financial data, RobustScaler is generally the safest default due to fat-tailed return distributions.

Leakage Prevention¶

The critical rule: fit on training data only, transform everything.

# CORRECT: fit on train, transform both
scaler = StandardScaler()
X_train = scaler.fit_transform(train_df)
X_test = scaler.transform(test_df)     # Uses train statistics

# WRONG: fitting on all data leaks test information
scaler = StandardScaler()
X_all = scaler.fit_transform(all_data)  # Leaks test statistics!

Scaler State¶

After fitting, inspect the learned statistics:

scaler.is_fitted           # True after fit/fit_transform
scaler.fitted_columns      # ["rsi", "macd", "atr", ...]
scaler.statistics          # {"rsi": {"mean": 52.3, "std": 15.1}, ...}

Serialization¶

Save and reload fitted scalers:

# Save
state = scaler.to_dict()

# Reload
scaler = StandardScaler.from_dict(state)

Cloning¶

Create an unfitted copy with the same parameters:

new_scaler = scaler.clone()  # Same params, unfitted

PreprocessingPipeline¶

For multi-step preprocessing, chain transforms:

from ml4t.engineer.preprocessing import PreprocessingPipeline

pipeline = PreprocessingPipeline.from_recommendations({
    "rsi_14": {"transform": "standardize", "confidence": 0.9},
    "volume": {"transform": "log", "confidence": 0.8},
    "returns": {"transform": "winsorize", "confidence": 0.85},
})

train_transformed = pipeline.fit_transform(train_df)
test_transformed = pipeline.transform(test_df)

Available Transform Types¶

Transform	Description
`NONE`	No transformation
`LOG`	Log transform (for skewed data)
`SQRT`	Square root transform
`STANDARDIZE`	Z-score normalization
`NORMALIZE`	Min-max scaling
`WINSORIZE`	Clip extreme values
`DIFF`	First difference

Integration with MLDatasetBuilder¶

The preprocessing module integrates with MLDatasetBuilder for a leakage-safe end-to-end workflow. See the Dataset Builder guide for details.

from ml4t.engineer import create_dataset_builder

builder = create_dataset_builder(
    features=features_df,
    labels=labels_series,
    scaler="robust",   # "standard", "minmax", "robust", or None
)

# Scaling happens automatically during train/test split
X_train, X_test, y_train, y_test = builder.train_test_split(train_size=0.8)

See It In The Book¶

Ch7 02_preprocessing_pipeline.py for split-aware preprocessing
ML Readiness for deciding which features need scaling first
Book Guide for the full Chapter 7 workflow map

Next Steps¶

Read Dataset Builder for the end-to-end training-data workflow.
Read ML Readiness to separate normalized and non-normalized features.
Use the API Reference for exact scaler and pipeline objects.