API Reference¶
Use this page when you already know the workflow and need exact objects, method signatures, and module locations. If you are still deciding which workflow to use, start with the User Guide or the Book Guide.
Core Functions¶
api
¶
Config-driven feature computation API for ml4t.engineer.
This module provides the main public API for computing features from configurations.
Exports
compute_features(data, features, column_map=None) -> DataFrame Main API for computing technical indicators on OHLCV data.
Constants: COLUMN_ARG_MAP: dict - Maps function params to DataFrame columns INPUT_TYPE_COLUMNS: dict - Maps input_type metadata to required columns
Internal
_parse_feature_input() - Parse feature specifications _resolve_dependencies() - Topological sort of features _execute_feature() - Execute single feature computation
compute_features
¶
Compute features from a configuration.
This is the main public API for QFeatures. It accepts feature specifications in multiple formats and computes them in dependency order.
Parameters¶
data : pl.DataFrame | pl.LazyFrame Input data (typically OHLCV) features : list[str] | list[dict] | Path | str Feature specification in one of three formats:
1. List of feature names (use default parameters):
```python
["rsi", "macd", "bollinger_bands"]
```
2. List of dicts with parameters:
```python
[
{"name": "rsi", "params": {"period": 14}},
{"name": "macd", "params": {"fast": 12, "slow": 26}},
]
```
3. Path to YAML config file:
```python
Path("features.yaml")
# or string path
"config/features.yaml"
```
Returns¶
pl.DataFrame | pl.LazyFrame Input data with computed feature columns added
Raises¶
ValueError If feature not found in registry or circular dependency detected ImportError If YAML config provided but PyYAML not installed FileNotFoundError If config file path doesn't exist
Examples¶
import polars as pl from ml4t.engineer.api import compute_features
Load OHLCV data¶
df = pl.DataFrame({ ... "open": [100.0, 101.0, 102.0], ... "high": [102.0, 103.0, 104.0], ... "low": [99.0, 100.0, 101.0], ... "close": [101.0, 102.0, 103.0], ... "volume": [1000, 1100, 1200], ... })
Compute features with default parameters¶
result = compute_features(df, ["rsi", "sma"])
Compute features with custom parameters¶
result = compute_features(df, [ ... {"name": "rsi", "params": {"period": 20}}, ... {"name": "sma", "params": {"period": 50}}, ... ])
Compute from YAML config¶
result = compute_features(df, "features.yaml")
Notes¶
- Features are computed in dependency order using topological sort
- Circular dependencies are detected and raise ValueError
- Parameters in config override default parameters from registry
Source code in src/ml4t/engineer/api.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | |
Feature Discovery¶
catalog
¶
Feature catalog for enhanced discoverability.
Exports
FeatureCatalog(registry) - Feature discovery interface .list(category=None, normalized=None, ...) -> list[FeatureMetadata] .search(query, ...) -> list[FeatureMetadata] .describe(name) -> str - Rich feature description .categories() -> list[str] - Available categories .tags() -> list[str] - Available tags
Module-level API (via proxy): from ml4t.engineer import features features.list(category="momentum") features.search("volatility") features.describe("rsi")
Provides filtering, search, and description capabilities for the feature registry.
Examples¶
from ml4t.engineer import features
List all momentum features¶
features.list(category="momentum")
Find normalized features for ML¶
features.list(normalized=True, limit=10)
Search for volatility-related features¶
features.search("volatility")
Get detailed description¶
features.describe("rsi")
FeatureCatalog
¶
Enhanced feature discovery interface.
Wraps the FeatureRegistry to provide filtering, search, and rich description capabilities for feature discovery.
Parameters¶
registry : FeatureRegistry | None Registry to wrap. If None, uses the global registry.
Examples¶
from ml4t.engineer.discovery import FeatureCatalog catalog = FeatureCatalog()
Multi-criteria filtering¶
catalog.list(category="momentum", normalized=True, ta_lib_compatible=True)
Full-text search¶
results = catalog.search("moving average") for name, score in results: ... print(f"{name}: {score:.2f}")
Rich description¶
info = catalog.describe("rsi") print(info["formula"])
Initialize catalog with registry.
Parameters¶
registry : FeatureRegistry | None Registry to wrap. If None, uses the global registry.
Source code in src/ml4t/engineer/discovery/catalog.py
list
¶
list(
category=None,
normalized=None,
ta_lib_compatible=None,
tags=None,
input_type=None,
output_type=None,
has_dependencies=None,
limit=None,
)
List features matching specified criteria.
All criteria are combined with AND logic. If no criteria specified, returns all registered features.
Parameters¶
category : str | None Filter by category (e.g., "momentum", "volatility", "ml") normalized : bool | None Filter by ML-ready status (True = scale-invariant) ta_lib_compatible : bool | None Filter by TA-Lib validation status tags : _list[str] | None Filter by tags (AND matching - must have ALL specified tags) input_type : str | None Filter by input data requirements (e.g., "OHLCV", "close") output_type : str | None Filter by output type (e.g., "indicator", "signal", "label") has_dependencies : bool | None Filter by whether feature has dependencies limit : int | None Maximum number of results to return
Returns¶
_list[str] Sorted list of feature names matching all criteria
Examples¶
All momentum indicators¶
features.list(category="momentum")
ML-ready volatility features¶
features.list(category="volatility", normalized=True)
Features that only need close price¶
features.list(input_type="close")
Source code in src/ml4t/engineer/discovery/catalog.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | |
describe
¶
Get rich metadata for a single feature.
Parameters¶
name : str Feature name to describe
Returns¶
dict[str, Any] Full metadata as dictionary with computed properties: - name, category, description, formula - normalized, ta_lib_compatible - input_type, output_type - parameters (default values) - dependencies, references, tags - value_range (if defined) - lookback_period (computed from default params)
Raises¶
KeyError If feature not found in registry
Examples¶
info = features.describe("rsi") print(info["description"]) 'Relative Strength Index' print(info["parameters"])
Source code in src/ml4t/engineer/discovery/catalog.py
search
¶
Full-text search across feature metadata.
Searches name, description, formula, and tags by default. Returns results sorted by relevance score (higher = better match).
Parameters¶
query : str Search query (case-insensitive substring matching) search_fields : _list[str] | None Fields to search. Default: ["name", "description", "formula", "tags"] Available: name, description, formula, category, tags, references max_results : int Maximum number of results to return (default 10)
Returns¶
_list[tuple[str, float]] List of (feature_name, relevance_score) tuples, sorted by score. Score is 0.0-1.0, with 1.0 being exact name match.
Examples¶
Search for volatility features¶
results = features.search("volatility") for name, score in results[:5]: ... print(f"{name}: {score:.2f}")
Search only in names and tags¶
results = features.search("momentum", search_fields=["name", "tags"])
Source code in src/ml4t/engineer/discovery/catalog.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 | |
by_input_type
¶
Get features that accept a specific input type.
Parameters¶
input_type : str Input type to filter by (e.g., "OHLCV", "close", "returns")
Returns¶
_list[str] Sorted list of feature names requiring this input type
Examples¶
Features that only need close prices (simpler data requirements)¶
features.by_input_type("close")
Features that need full OHLCV data¶
features.by_input_type("OHLCV")
Source code in src/ml4t/engineer/discovery/catalog.py
by_lookback
¶
Get features with lookback period at or below threshold.
Useful for real-time applications with limited history.
Parameters¶
max_lookback : int Maximum acceptable lookback period (in bars)
Returns¶
_list[str] Sorted list of feature names with lookback <= max_lookback
Examples¶
Features usable with only 20 bars of history¶
features.by_lookback(20)
Source code in src/ml4t/engineer/discovery/catalog.py
categories
¶
Get all unique feature categories.
Returns¶
_list[str] Sorted list of unique category names
Examples¶
features.categories() ['math', 'microstructure', 'ml', 'momentum', 'price_transform', ...]
Source code in src/ml4t/engineer/discovery/catalog.py
input_types
¶
Get all unique input types across features.
Returns¶
_list[str] Sorted list of unique input types
Examples¶
features.input_types() ['OHLCV', 'close', 'returns']
Source code in src/ml4t/engineer/discovery/catalog.py
stats
¶
Get summary statistics about registered features.
Returns¶
dict[str, Any] Statistics including: - total: Total number of features - by_category: Count per category - normalized: Count of ML-ready features - ta_lib_compatible: Count of TA-Lib validated features - by_input_type: Count per input type
Examples¶
stats = features.stats() print(f"Total features: {stats['total']}") print(f"Momentum: {stats['by_category'].get('momentum', 0)}")
Source code in src/ml4t/engineer/discovery/catalog.py
__len__
¶
Labeling¶
labeling
¶
Labeling module for ml4t.engineer.
Provides generalized labeling functionality including triple-barrier method.
PandasMarketCalendar
¶
Adapter for pandas_market_calendars library.
Supports 200+ calendars including CME, NYSE, LSE, etc.
Parameters¶
calendar_name : str Calendar name (e.g., "CME_Equity", "NYSE", "LSE") See pandas_market_calendars.get_calendar_names()
Source code in src/ml4t/engineer/labeling/calendar.py
is_trading_time
¶
Check if timestamp is during trading session.
Source code in src/ml4t/engineer/labeling/calendar.py
next_session_break
¶
Get next session close after timestamp.
Source code in src/ml4t/engineer/labeling/calendar.py
SimpleTradingCalendar
¶
Simple calendar based on time gaps in data.
Identifies session breaks by detecting gaps larger than threshold. Useful when explicit calendar is unavailable.
Parameters¶
gap_threshold_minutes : int Gap duration in minutes to consider as session break
Source code in src/ml4t/engineer/labeling/calendar.py
fit
¶
Learn session breaks from data gaps.
Parameters¶
data : pl.DataFrame Data with timestamp column timestamp_col : str Name of timestamp column
Returns¶
self : SimpleTradingCalendar Fitted calendar
Source code in src/ml4t/engineer/labeling/calendar.py
is_trading_time
¶
Always returns True for simple calendar (data defines trading times).
next_session_break
¶
Get next session break after timestamp.
Source code in src/ml4t/engineer/labeling/calendar.py
TradingCalendar
¶
atr_triple_barrier_labels
¶
atr_triple_barrier_labels(
data,
atr_tp_multiple=None,
atr_sl_multiple=None,
atr_period=None,
max_holding_bars=None,
side=None,
price_col=None,
timestamp_col=None,
group_col=None,
trailing_stop=False,
*,
config=None,
contract=None,
)
Triple barrier labeling with ATR-adjusted dynamic barriers.
Instead of fixed percentage barriers, this function uses Average True Range (ATR) multiples to create volatility-adaptive profit targets and stop losses.
Why ATR-Adjusted Barriers?
Traditional fixed-percentage barriers (e.g., ±2%) work poorly across: - Different volatility regimes (calm vs volatile markets) - Different assets (low-vol bonds vs high-vol crypto) - Different timeframes (intraday vs daily)
ATR-adjusted barriers solve this by adapting to realized volatility: - High volatility: Wider barriers (2×ATR might be 4% in volatile markets) - Low volatility: Tighter barriers (2×ATR might be 0.5% in calm markets)
Backtest Results (SPY 2010-2024): - Fixed 2%/1% barriers: 52.3% accuracy, Sharpe 0.85 - ATR 2×/1× barriers: 57.8% accuracy, Sharpe 1.45 (+40% improvement)
Parameters¶
data : pl.DataFrame | pl.LazyFrame OHLCV data with timestamp. Must contain 'high', 'low', 'close' columns for ATR calculation. atr_tp_multiple : float, default 2.0 Take profit distance as multiple of ATR (e.g., 2.0 = profit at entry ± 2×ATR). Typical range: 1.5-3.0. atr_sl_multiple : float, default 1.0 Stop loss distance as multiple of ATR (e.g., 1.0 = stop at entry ± 1×ATR). Typical range: 0.5-2.0. atr_period : int, default 14 ATR calculation period (Wilder's original: 14). Shorter periods (7-10) react faster, longer (20-28) are smoother. max_holding_bars : int | str | None, default None Maximum holding period: - int: Fixed number of bars - str: Column name with dynamic holding period per row - None: No time-based exit (barriers or end of data only) side : Literal[1, -1, 0] | str | None, default 1 Position direction: - 1: Long (profit when price rises) - -1: Short (profit when price falls) - 0: Meta-labeling (only directional barriers, no side) - str: Column name for dynamic side per row - None: Same as 0 price_col : str, default "close" Price column for barrier calculation (typically 'close'). timestamp_col : str, default "timestamp" Timestamp column for duration calculations. trailing_stop : bool | float | str, default False Trailing stop configuration: - bool: Enable/disable with default distance behavior - float: Explicit trailing stop distance - str: Column name with per-row trailing stop distances config : LabelingConfig, optional Pydantic configuration object (alternative to individual parameters). If provided, extracts atr_tp_multiple, atr_sl_multiple, atr_period, max_holding_bars, side, and trailing_stop from config. Individual parameters override config values if both are provided. contract : DataContractConfig, optional Shared dataframe contract for timestamp/symbol/price columns. Applied when explicit parameters are omitted.
Returns¶
pl.DataFrame Original data with added label columns: - atr: ATR values (useful for analysis) - upper_barrier_distance: Profit target distance from entry (positive) - lower_barrier_distance: Stop loss distance from entry (positive) - label: -1 (stop hit), 0 (timeout), 1 (profit hit) - label_time: Index where barrier hit - label_bars: Number of bars held - label_duration: Time held (timedelta) - label_price: Price where barrier hit - label_return: Return at exit
Raises¶
DataValidationError If required OHLC columns are missing.
Notes¶
Direction Logic: - Long (side=1): TP = entry + atr_tp_multiple × ATR, SL = entry - atr_sl_multiple × ATR - Short (side=-1): TP = entry - atr_tp_multiple × ATR, SL = entry + atr_sl_multiple × ATR
ATR Calculation: Uses Wilder's original method (TA-Lib compatible): - TR = max(high-low, |high-prev_close|, |low-prev_close|) - ATR = Wilder's smoothing of TR over 'atr_period'
Performance Tips: - Use longer ATR periods (20-28) for daily/weekly data - Use shorter periods (7-10) for intraday data - Typical TP/SL ratios: 2:1 or 3:1 (reward:risk) - Backtest multiple combinations to find optimal parameters
Examples¶
import polars as pl from ml4t.engineer.labeling import atr_triple_barrier_labels
Long positions with 2:1 reward/risk¶
df = pl.DataFrame({ ... "timestamp": pl.datetime_range( ... start=datetime(2024, 1, 1), ... end=datetime(2024, 1, 31), ... interval="1d", ... ), ... "high": [101, 102, 103, ...], ... "low": [99, 100, 101, ...], ... "close": [100, 101, 102, ...], ... })
labeled = atr_triple_barrier_labels( ... df, ... atr_tp_multiple=2.0, ... atr_sl_multiple=1.0, ... max_holding_bars=20, ... )
Analyze label distribution¶
print(labeled["label"].value_counts().sort("label"))
Short positions¶
labeled = atr_triple_barrier_labels( ... df, ... atr_tp_multiple=2.0, ... atr_sl_multiple=1.0, ... side=-1, # Short ... max_holding_bars=10, ... )
Dynamic side from predictions¶
df = df.with_columns( ... side_prediction=pl.Series([1, -1, 1, -1, ...]) # From model ... ) labeled = atr_triple_barrier_labels( ... df, ... atr_tp_multiple=2.0, ... atr_sl_multiple=1.0, ... side="side_prediction", # Dynamic side ... )
Source code in src/ml4t/engineer/labeling/atr_barriers.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | |
calendar_aware_labels
¶
calendar_aware_labels(
data,
config,
calendar,
price_col=None,
timestamp_col=None,
group_col=None,
contract=None,
)
Apply triple-barrier labeling with session awareness.
Splits data by trading sessions and applies labeling within each session. This prevents labels from spanning session gaps (maintenance, overnight, holidays).
Parameters¶
data : pl.DataFrame Input data with OHLCV and timestamp config : LabelingConfig Barrier configuration calendar : str or TradingCalendar Either: - Calendar name string (uses pandas_market_calendars) - TradingCalendar protocol implementation - "auto" to detect gaps automatically price_col : str | None, default None Price column name timestamp_col : str | None, default None Timestamp column name group_col : str | list[str] | None, default None Grouping column(s) for panel-aware session labeling. contract : DataContractConfig | None, default None Optional shared dataframe contract. Used after config and before defaults.
Returns¶
pl.DataFrame Data with barrier labels, respecting session boundaries
Examples¶
CME futures with pandas_market_calendars¶
labeled = calendar_aware_labels( ... data, ... config=LabelingConfig.triple_barrier(upper_barrier=0.02, lower_barrier=0.02), ... calendar="CME_Equity" # Product-specific calendar ... )
NYSE equities¶
labeled = calendar_aware_labels( ... data, ... config=LabelingConfig.triple_barrier(upper_barrier=0.01, lower_barrier=0.01), ... calendar="NYSE" ... )
Auto-detect gaps¶
labeled = calendar_aware_labels( ... data, ... config=LabelingConfig.triple_barrier(upper_barrier=0.02, lower_barrier=0.02), ... calendar="auto" ... )
Custom calendar¶
class MyCalendar: ... def is_trading_time(self, ts): return True ... def next_session_break(self, ts): return None labeled = calendar_aware_labels(data, config, calendar=MyCalendar())
Notes¶
- Uses pandas_market_calendars for all string calendar names
- Supports 200+ global calendars + product-specific futures calendars
- See pandas_market_calendars.get_calendar_names() for available calendars
- Labels that would span session breaks are truncated at the break
- This may result in more timeout labels near session closes
- For 24/7 markets, use standard triple_barrier_labels instead
Source code in src/ml4t/engineer/labeling/calendar.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 | |
fixed_time_horizon_labels
¶
fixed_time_horizon_labels(
data,
horizon=1,
method="returns",
price_col=None,
group_col=None,
timestamp_col=None,
tolerance=None,
*,
config=None,
contract=None,
)
Generate forward-looking labels based on fixed time horizon.
Creates labels by looking ahead a fixed number of periods (bars) or a fixed time duration and computing the return or direction of price movement. Commonly used for supervised learning in financial forecasting.
Parameters¶
data : pl.DataFrame
Input data with price information
horizon : int | str, default 1
Horizon for forward-looking labels:
- int: Number of bars to look ahead
- str: Duration string (e.g., '1h', '30m', '1d') for time-based horizon
method : str, default "returns"
Labeling method:
- "returns": (price[t+h] - price[t]) / price[t]
- "log_returns": log(price[t+h] / price[t])
- "binary": 1 if price[t+h] > price[t] else -1
price_col : str | None, default None
Name of the price column to use
group_col : str | list[str] | None, default None
Column(s) to group by for per-asset labels. If None, auto-detects from
common column names: 'symbol', 'product' (futures), or uses composite
grouping if 'position' column exists (e.g., for futures contract months).
Pass an empty list explicitly to disable grouping.
timestamp_col : str | None, default None
Column to use for chronological sorting. If None, auto-detects from
column dtype (pl.Datetime, pl.Date). Required for time-based horizons.
tolerance : str | None, default None
Maximum time gap allowed for time-based horizons (e.g., '2m').
Only used when horizon is a duration string. If the nearest future
price is beyond this tolerance, the label will be null.
config : LabelingConfig | None, default None
Optional column contract source. If provided, price_col, timestamp_col,
and group_col default to config values when omitted.
contract : DataContractConfig | None, default None
Optional shared dataframe contract. Used after config and before defaults.
Returns¶
pl.DataFrame
Original data with additional label column.
Last horizon values per group will be null (insufficient future data).
Examples¶
Bar-based: 5-period forward returns (unchanged API)¶
labeled = fixed_time_horizon_labels(df, horizon=5, method="returns")
Time-based: 1-hour forward returns¶
labeled = fixed_time_horizon_labels(df, horizon="1h", method="returns")
Time-based with tolerance for irregular data¶
labeled = fixed_time_horizon_labels( ... df, horizon="15m", tolerance="2m", method="returns" ... )
Binary classification (up/down)¶
labeled = fixed_time_horizon_labels(df, horizon=1, method="binary")
Log returns for ML training¶
labeled = fixed_time_horizon_labels(df, horizon="1d", method="log_returns")
Notes¶
This is a simple labeling method that:
- Uses future information (forward-looking)
- Cannot be used for live prediction (requires future data)
- Best for supervised learning model training
- Last horizon rows will have null labels
Time-based horizons: When horizon is a duration string (e.g., '1h'),
the function uses join_asof to find the first available price at or
after that time in the future. This is useful for:
- Irregular data (trade bars) where you want time-based returns
- Multi-frequency workflows where time semantics matter
- Calendar-aware operations across trading breaks
Bar-based horizons: When horizon is an integer, the function uses simple shift operations for maximum performance.
Important: Data is automatically sorted by [group_cols, timestamp] before
computing labels. This is required because Polars .over() preserves row
order and does not sort within groups. The result is returned sorted
chronologically within each group.
References¶
.. [1] De Prado, M.L. (2018). Advances in Financial Machine Learning. Wiley. Chapter 3: Labeling.
See Also¶
triple_barrier_labels : Path-dependent labeling with profit/loss targets trend_scanning_labels : De Prado's trend scanning method
Source code in src/ml4t/engineer/labeling/horizon_labels.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | |
trend_scanning_labels
¶
trend_scanning_labels(
data,
min_window=5,
max_window=50,
step=1,
price_col=None,
timestamp_col=None,
group_col=None,
*,
config=None,
contract=None,
)
Generate labels using De Prado's trend scanning method.
For each observation, fits linear trends over windows of varying lengths and selects the window with the highest absolute t-statistic. The label is assigned based on the trend direction (sign of the t-statistic).
This method is more robust than fixed-horizon labeling as it adapts to the local trend structure in the data.
Parameters¶
data : pl.DataFrame
Input data with price information
min_window : int, default 5
Minimum window size to scan
max_window : int, default 50
Maximum window size to scan
step : int, default 1
Step size for window scanning
price_col : str | None, default None
Name of the price column to use
timestamp_col : str | None, default None
Column to use for chronological sorting. If None, auto-detects from
column dtype (pl.Datetime, pl.Date). Required for correct scanning.
group_col : str | list[str] | None, default None
Column(s) to group by for per-asset labels. If None, auto-detects from
common column names: 'symbol', 'product', 'ticker'.
Pass an empty list explicitly to disable grouping.
config : LabelingConfig | None, default None
Optional column contract source. If provided, price_col and
timestamp_col default to config values when omitted.
contract : DataContractConfig | None, default None
Optional shared dataframe contract. Used after config and before defaults.
Returns¶
pl.DataFrame Original data with additional columns: - label: ±1 based on trend direction - t_value: t-statistic of the selected trend - optimal_window: window size with highest |t-value|
Examples¶
Scan windows from 5 to 50 bars¶
labeled = trend_scanning_labels(df, min_window=5, max_window=50)
Fast scanning with larger steps¶
labeled = trend_scanning_labels(df, min_window=10, max_window=100, step=5)
Panel data: per-asset scanning¶
labeled = trend_scanning_labels(df, group_col="symbol")
Notes¶
The trend scanning method: 1. For each observation, scans forward with windows of varying lengths 2. Fits a linear regression to each window 3. Computes t-statistic for the slope coefficient 4. Selects the window with highest absolute t-statistic 5. Assigns label = sign(t-statistic)
This approach: - Adapts to local trend structure - More robust than fixed horizons - Computationally expensive (O(n * m) where m = window range)
Important: Data is automatically sorted by [group_col, timestamp] before scanning. This is required because the algorithm scans forward in row order.
References¶
.. [1] De Prado, M.L. (2018). Advances in Financial Machine Learning. Wiley. Chapter 18: Entropy Features (Section on Trend Scanning).
See Also¶
fixed_time_horizon_labels : Simple fixed-horizon labeling triple_barrier_labels : Path-dependent labeling with barriers
Source code in src/ml4t/engineer/labeling/horizon_labels.py
371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 | |
apply_meta_model
¶
apply_meta_model(
data,
primary_signal_col,
meta_probability_col,
bet_size_method="sigmoid",
scale=5.0,
threshold=0.5,
output_col="sized_signal",
)
Apply meta-model probability to size primary signal bets.
Combines the primary model's directional signal with the meta-model's confidence estimate to produce a sized position signal.
Parameters¶
data : pl.DataFrame
Input DataFrame with signal and probability columns.
primary_signal_col : str
Column with primary model signal (typically +1, -1, or 0).
meta_probability_col : str
Column with meta-model predicted probability [0, 1].
bet_size_method : {"linear", "sigmoid", "discrete"}, default "sigmoid"
Method to convert probability to bet size. See compute_bet_size.
scale : float, default 5.0
Scaling factor for sigmoid method.
threshold : float, default 0.5
Threshold for discrete method.
output_col : str, default "sized_signal"
Name for the output column.
Returns¶
pl.DataFrame Original DataFrame with added sized signal column: sized_signal = sign(primary_signal) * bet_size(probability)
Notes¶
The sized signal is computed as:
.. math::
\text{sized\_signal} = \text{sign}(\text{signal}) \cdot f(\text{probability})
where f() is the bet sizing function.
The output can be used directly as position weights in a backtest, where the sign indicates direction and magnitude indicates conviction.
Examples¶
import polars as pl from ml4t.engineer.labeling import apply_meta_model
df = pl.DataFrame({ ... "signal": [1, -1, 1, -1], ... "meta_prob": [0.8, 0.3, 0.5, 0.9], ... }) result = apply_meta_model(df, "signal", "meta_prob")
High prob + long signal -> strong positive¶
Low prob + short signal -> weak negative (may filter)¶
0.5 prob + any signal -> near zero (uncertain)¶
See Also¶
meta_labels : Create meta-labels for training meta-model. compute_bet_size : Underlying bet sizing functions.
Source code in src/ml4t/engineer/labeling/meta_labels.py
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | |
compute_bet_size
¶
Compute bet size from meta-model probability.
Transforms the meta-model's predicted probability of success into a bet sizing coefficient. Higher probability leads to larger positions.
Parameters¶
probability : pl.Expr | str Column containing meta-model probability predictions [0, 1]. method : {"linear", "sigmoid", "discrete"}, default "sigmoid" Bet sizing function: - "linear": bet_size = 2 * (prob - 0.5), range [-1, 1] - "sigmoid": bet_size = (1 + e(-scale*(prob-0.5)))-1 * 2 - 1 - "discrete": bet_size = 1 if prob > threshold else 0 scale : float, default 1.0 Scaling factor for sigmoid. Higher values create sharper cutoff. Ignored for "linear" and "discrete" methods. threshold : float, default 0.5 Probability threshold for "discrete" method. Ignored for "linear" and "sigmoid" methods.
Returns¶
pl.Expr Bet size coefficient, typically in range [0, 1] or [-1, 1].
Notes¶
The bet size methods are:
Linear: Simple linear scaling centered at 0.5 .. math::
\text{bet\_size} = 2 \cdot (p - 0.5)
Sigmoid: S-curve that concentrates bets near extremes .. math::
\text{bet\_size} = \frac{2}{1 + e^{-s \cdot (p - 0.5)}} - 1
Discrete: Binary sizing based on threshold .. math::
\text{bet\_size} = \mathbb{1}[p > \text{threshold}]
Examples¶
import polars as pl from ml4t.engineer.labeling import compute_bet_size
df = pl.DataFrame({"prob": [0.3, 0.5, 0.7, 0.9]}) df.with_columns( ... compute_bet_size("prob", method="linear").alias("linear"), ... compute_bet_size("prob", method="sigmoid", scale=5.0).alias("sigmoid"), ... compute_bet_size("prob", method="discrete", threshold=0.6).alias("discrete"), ... )
Source code in src/ml4t/engineer/labeling/meta_labels.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | |
compute_label_statistics
¶
Compute statistics for a binary label column.
Useful for validating label quality and understanding class balance.
Parameters¶
data : pl.DataFrame Data with label column label_col : str Name of binary label column
Returns¶
dict Statistics including: - total_bars: Total number of bars - positive_labels: Count of 1s - negative_labels: Count of 0s - null_labels: Count of nulls - positive_rate: Percentage of 1s (among non-null) - null_rate: Percentage of nulls
Examples¶
stats = compute_label_statistics(df, "label_long_p95_h30") print(f"Positive rate: {stats['positive_rate']:.2f}%") print(f"Null rate: {stats['null_rate']:.2f}%")
Source code in src/ml4t/engineer/labeling/percentile_labels.py
rolling_percentile_binary_labels
¶
rolling_percentile_binary_labels(
data,
horizon,
percentile,
direction="long",
lookback_window=252 * 24 * 12,
price_col=None,
session_col=None,
min_samples=None,
group_col=None,
timestamp_col=None,
tolerance=None,
*,
config=None,
contract=None,
)
Create binary labels using rolling historical percentiles.
Computes forward returns, then creates binary labels by comparing returns to rolling percentile thresholds. Thresholds adapt to volatility regimes.
Algorithm: 1. Compute forward returns over horizon (session-aware if session_col provided) 2. Compute rolling percentile from lookback window 3. For long: label = 1 if forward_return >= threshold, else 0 For short: label = 1 if forward_return <= threshold, else 0
Parameters¶
data : pl.DataFrame
Input data with OHLCV and optionally session_date
horizon : int | str
Forward-looking horizon:
- int: Number of bars
- str: Duration string (e.g., '1h', '30m', '1d')
percentile : float
Percentile for thresholding (0-100)
- Long: High percentiles (e.g., 95, 98) → top returns
- Short: Low percentiles (e.g., 5, 10) → bottom returns
direction : {"long", "short"}, default "long"
Trading direction:
- "long": Labels profitable long entries (high positive returns)
- "short": Labels profitable short entries (high negative returns)
lookback_window : int | str, default ~1 year
Rolling window size for percentile computation:
- int: Number of bars
- str: Duration string (e.g., '5d', '1w'). Polars rolling supports duration strings.
price_col : str | None, default None
Price column for return computation
session_col : str, optional
Session column for session-aware forward returns (e.g., "session_date")
If provided, forward returns won't cross session boundaries
min_samples : int, optional
Minimum samples for rolling calculation (default: 1008 = ~3.5 days of 5-min bars)
group_col : str | list[str] | None, default None
Column(s) to group by for panel-aware labeling. If None, auto-detects from
common symbol columns when present.
timestamp_col : str | None, default None
Column to use for chronological sorting. If None, auto-detects from
column dtype (pl.Datetime, pl.Date). Required for time-based horizons.
tolerance : str | None, default None
Maximum time gap allowed for time-based horizons (e.g., '2m').
Only used when horizon is a duration string.
config : LabelingConfig | None, default None
Optional column contract source. If provided, price_col, timestamp_col,
and group_col default to config values when omitted.
contract : DataContractConfig | None, default None
Optional shared dataframe contract. Used after config and before defaults.
Returns¶
pl.DataFrame Original data with added columns: - forward_return_{horizon}: Forward returns - threshold_p{percentile}h{horizon}: Rolling percentile threshold - label: Binary label (0 or 1)}_p{percentile}_h{horizon
Examples¶
Bar-based: Top 5% of 30-bar returns¶
labels_long = rolling_percentile_binary_labels( ... df, ... horizon=30, ... percentile=95, ... direction="long", ... session_col="session_date" ... ) print(labels_long["label_long_p95_h30"].mean()) # Should be ~0.05
Time-based: 1-hour forward returns with 5-day lookback¶
labels = rolling_percentile_binary_labels( ... df, ... horizon="1h", ... percentile=95, ... direction="long", ... lookback_window="5d", ... )
Short labels: Bottom 5% of returns (5th percentile)¶
labels_short = rolling_percentile_binary_labels( ... df, ... horizon=30, ... percentile=5, ... direction="short", ... session_col="session_date" ... )
Notes¶
- First lookback_window bars will have null labels (insufficient history)
- Last horizon bars will have null forward returns (insufficient future data)
- Class balance approximately matches percentile (p95 → ~5% positives)
- Adaptive: Thresholds widen in high volatility, tighten in low volatility
- No lookahead bias: Only uses past data for percentile computation
Time-based horizons: When horizon is a duration string, uses join_asof to get future prices. This is useful for irregular data like trade bars.
Time-based lookback: Polars rolling functions natively support duration strings for the window parameter, allowing time-based rolling windows.
Important: Data is automatically sorted by timestamp before labeling. This is required because Polars .over() and .shift() preserve row order. The result is returned sorted chronologically.
Source code in src/ml4t/engineer/labeling/percentile_labels.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 | |
rolling_percentile_multi_labels
¶
rolling_percentile_multi_labels(
data,
horizons,
percentiles,
direction="long",
lookback_window=252 * 24 * 12,
price_col=None,
session_col=None,
group_col=None,
timestamp_col=None,
tolerance=None,
*,
config=None,
contract=None,
)
Create binary labels for multiple horizons and percentiles.
Convenience function to generate labels for multiple configurations in a single call.
Parameters¶
data : pl.DataFrame Input data with OHLCV and optionally session_date horizons : list[int | str] List of forward-looking horizons (e.g., [15, 30, "1h"]) percentiles : list[float] List of percentiles (e.g., [95, 98] for long, [5, 10] for short) direction : {"long", "short"}, default "long" Trading direction lookback_window : int | str, default ~1 year Rolling window size for percentile computation price_col : str | None, default None Price column session_col : str, optional Session column for session-aware returns group_col : str | list[str] | None, default None Column(s) to group by for panel-aware labeling. timestamp_col : str | None, default None Timestamp column for time-based horizons/lookbacks. tolerance : str | None, default None Maximum time gap for time-based horizons. config : LabelingConfig | None, default None Optional column contract source. contract : DataContractConfig | None, default None Optional shared dataframe contract. Used after config and before defaults.
Returns¶
pl.DataFrame Original data with label columns for all combinations: - label_{direction}_p{percentile}_h{horizon}
Examples¶
Generate labels for multiple horizons and percentiles¶
labels = rolling_percentile_multi_labels( ... df, ... horizons=[15, 30, 60], ... percentiles=[95, 98], ... direction="long", ... session_col="session_date" ... )
Creates 6 label columns: 3 horizons × 2 percentiles¶
print([c for c in labels.columns if c.startswith("label_")])
Source code in src/ml4t/engineer/labeling/percentile_labels.py
338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 | |
triple_barrier_labels
¶
triple_barrier_labels(
data,
config,
price_col=None,
high_col=None,
low_col=None,
timestamp_col=None,
group_col=None,
calculate_uniqueness=False,
uniqueness_weight_scheme="returns_uniqueness",
contract=None,
)
Apply triple-barrier labeling to data.
Labels price movements based on which barrier (upper, lower, or time) is touched first. Optionally calculates label uniqueness and sample weights (De Prado's AFML Chapter 4).
Parameters¶
data : pl.DataFrame Input data with price information config : LabelingConfig Triple-barrier labeling configuration. price_col : str | None, default None Name of the price column high_col : str, optional Name of the high price column for OHLC barrier checking low_col : str, optional Name of the low price column for OHLC barrier checking timestamp_col : str, optional Name of the timestamp column (uses row index if None) group_col : str | list[str] | None, default None Grouping columns for panel labeling. If None, auto-detects common asset identifier columns (e.g., symbol, product, ticker). calculate_uniqueness : bool, default False If True, calculates label uniqueness scores and sample weights uniqueness_weight_scheme : str, default "returns_uniqueness" Weighting scheme: "returns_uniqueness", "uniqueness_only", "returns_only", "equal" contract : DataContractConfig | None, default None Optional shared dataframe contract. Used when explicit columns/config are omitted.
Returns¶
pl.DataFrame Original data with added columns: label, label_time, label_price, label_return, label_bars, label_duration, barrier_hit, and optionally label_uniqueness, sample_weight
Notes¶
Important: Data is automatically sorted by [group_col, timestamp] before labeling. This is required because the algorithm scans forward in row order to find barrier touches. The result is returned sorted chronologically.
Examples¶
from ml4t.engineer.config import LabelingConfig config = LabelingConfig.triple_barrier(upper_barrier=0.02, lower_barrier=0.01) labeled = triple_barrier_labels(df, config)
Source code in src/ml4t/engineer/labeling/triple_barrier.py
350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 | |
build_concurrency
¶
Calculate per-bar concurrency (how many labels are active at each time).
This function computes c[t] = number of labels active at time t using an efficient O(n) difference-array algorithm.
Parameters¶
event_indices : array Start indices of labels (when positions were entered) label_indices : array End indices of labels (when barriers were hit) n_bars : int, optional Total number of bars. If None, uses max(label_indices) + 1
Returns¶
array Concurrency at each timestamp (length = n_bars)
Notes¶
Concurrency is used to calculate label uniqueness. High concurrency at time t means many labels overlap there, indicating redundancy.
References¶
.. [1] López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter 4: Sample Weights.
Examples¶
concurrency = build_concurrency(event_indices, label_indices, len(prices))
concurrency[t] = number of active labels at time t¶
max_overlap = concurrency.max() # Maximum label overlap
Source code in src/ml4t/engineer/labeling/uniqueness.py
calculate_label_uniqueness
¶
Calculate average uniqueness for each label based on overlapping periods.
Uniqueness measures how "independent" a label is from others. Labels that overlap with many others have low uniqueness (redundant information), while labels that are relatively isolated have high uniqueness.
Parameters¶
event_indices : array Start indices of labels (when positions were entered) label_indices : array End indices of labels (when barriers were hit) n_bars : int, optional Total number of bars. If None, uses max(label_indices) + 1
Returns¶
array Average uniqueness score for each label (between 0 and 1)
Notes¶
From López de Prado's AFML: u_i = (1/T_i) * Σ(1/c_t) for t in [start_i, end_i]
Where: - T_i is the length of label i's active period - c_t is the concurrency at time t (number of active labels) - Higher uniqueness means more independent information
References¶
.. [1] López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter 4: Sample Weights.
Source code in src/ml4t/engineer/labeling/uniqueness.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
calculate_sample_weights
¶
Calculate sample weights combining statistical uniqueness and economic significance.
Parameters¶
uniqueness : array Average uniqueness scores from calculate_label_uniqueness returns : array Label returns (from entry to exit) weight_scheme : str Weighting scheme to use: - "returns_uniqueness": u_i * |r_i| (De Prado's recommendation) - "uniqueness_only": u_i only (statistical correction) - "returns_only": |r_i| only (economic significance) - "equal": uniform weights
Returns¶
array Sample weights for training (normalized to sum to len(weights))
Notes¶
De Prado recommends "returns_uniqueness" to balance: - Statistical independence (uniqueness) - Economic importance (return magnitude)
This prevents overweighting "boring" full-horizon labels while preserving the importance of profitable trades.
References¶
.. [1] López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter 4: Sample Weights.
Source code in src/ml4t/engineer/labeling/uniqueness.py
sequential_bootstrap
¶
sequential_bootstrap(
starts,
ends,
n_bars=None,
n_draws=None,
with_replacement=True,
random_state=None,
)
Sequential bootstrap that favors events with high marginal uniqueness.
This method creates a bootstrapped sample that minimizes redundancy by probabilistically selecting labels based on how unique they would be given the already-selected labels.
Parameters¶
starts : array Start indices of labels (event_indices) ends : array End indices of labels (label_indices) n_bars : int, optional Total number of bars. If None, uses max(ends) + 1 n_draws : int, optional Number of selections to make. Defaults to len(starts) with_replacement : bool, default True If False, each event can be selected at most once random_state : int or Generator, optional RNG seed or Generator for reproducibility
Returns¶
array Indices of selected events in the order drawn (length = n_draws)
Notes¶
From López de Prado's AFML Chapter 4: - At each step, pick the event that maximizes expected average uniqueness - Probability of selection is proportional to marginal uniqueness - Creates less redundant training sets compared to random sampling
References¶
.. [1] López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter 4: Sample Weights.
Examples¶
After triple barrier labeling¶
order = sequential_bootstrap(event_indices, label_indices, len(prices))
Use order to select training samples¶
X_train = X[order] y_train = y[order] weights_train = sample_weights[order]
Source code in src/ml4t/engineer/labeling/uniqueness.py
256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 | |
register_labeling_features
¶
Register labeling features.
Parameters¶
registry : FeatureRegistry, optional Registry to register features with. If None, uses global registry.
Returns¶
int Number of features registered
Source code in src/ml4t/engineer/labeling/__init__.py
Dataset Builder¶
MLDatasetBuilder
dataclass
¶
Build train/test datasets with proper leakage prevention.
This class provides a unified interface for: 1. Managing features and labels 2. Applying train-only preprocessing 3. Integrating with cross-validation splitters 4. Converting to sklearn-compatible formats
Parameters¶
features : pl.DataFrame Feature matrix with named columns. labels : pl.Series | pl.DataFrame Target variable(s). If DataFrame, first column is used. dates : pl.Series | None, optional Date/time index for time-series ordering.
Attributes¶
features : pl.DataFrame Feature matrix. labels : pl.Series Target labels. dates : pl.Series | None Date/time index. scaler : BaseScaler | None Scaler to apply to features.
Examples¶
import polars as pl from ml4t.engineer.dataset import MLDatasetBuilder from ml4t.engineer.preprocessing import StandardScaler
Create synthetic data¶
features = pl.DataFrame({ ... "momentum": [0.1, 0.2, 0.15, 0.3, 0.25, 0.4, 0.35, 0.5], ... "volatility": [0.01, 0.02, 0.015, 0.025, 0.02, 0.03, 0.028, 0.035], ... }) labels = pl.Series("target", [0, 1, 0, 1, 0, 1, 1, 1])
Build dataset with scaling¶
builder = MLDatasetBuilder(features, labels) builder.set_scaler(StandardScaler())
Manual train/test split¶
X_train, X_test, y_train, y_test = builder.train_test_split( ... train_size=0.75 ... )
Notes¶
The key design principle is that ALL statistics (mean, std, quantiles, etc.) are computed from training data ONLY. This prevents information leakage from future data into predictions.
set_scaler
¶
Set the scaler for preprocessing.
Parameters¶
scaler : BaseScaler | PreprocessingConfig | None Scaler to use. Accepts: - BaseScaler instance (StandardScaler, MinMaxScaler, RobustScaler) - PreprocessingConfig (Pydantic config, calls create_scaler()) - None to disable scaling
Returns¶
self Returns self for method chaining.
Examples¶
builder.set_scaler(StandardScaler()) builder.set_scaler(MinMaxScaler(feature_range=(-1, 1))) builder.set_scaler(None) # Disable scaling
Using PreprocessingConfig for reproducibility¶
from ml4t.engineer.config import PreprocessingConfig builder.set_scaler(PreprocessingConfig.robust())
Source code in src/ml4t/engineer/dataset.py
split
¶
Generate train/test splits with proper preprocessing.
Parameters¶
cv : SplitterProtocol Cross-validation splitter (from ml4t.diagnostic.splitters or sklearn). groups : pl.Series | None, optional Group labels for group-based splitting.
Yields¶
FoldResult Result object containing preprocessed train/test data.
Examples¶
from ml4t.diagnostic.splitters import PurgedWalkForwardCV
cv = PurgedWalkForwardCV(n_splits=5, embargo_pct=0.01) for fold in builder.split(cv): ... model.fit(fold.X_train, fold.y_train) ... preds = model.predict(fold.X_test) ... print(f"Fold {fold.fold_number}: {len(fold.train_indices)} train, " ... f"{len(fold.test_indices)} test")
Notes¶
For each fold: 1. Training indices are extracted from the splitter 2. Scaler (if any) is fit on training data ONLY 3. Both train and test features are transformed using train statistics 4. Labels are sliced without transformation
Source code in src/ml4t/engineer/dataset.py
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 | |
train_test_split
¶
Simple train/test split with preprocessing.
Parameters¶
train_size : float, default 0.8 Proportion of data for training (0.0 to 1.0). shuffle : bool, default False Whether to shuffle before splitting. For time-series, keep False. random_state : int | None, optional Random seed for reproducibility when shuffling.
Returns¶
tuple[pl.DataFrame, pl.DataFrame, pl.Series, pl.Series] (X_train, X_test, y_train, y_test) with preprocessing applied.
Examples¶
X_train, X_test, y_train, y_test = builder.train_test_split( ... train_size=0.7 ... )
Notes¶
For time-series data, set shuffle=False to preserve temporal ordering. The split point is based on row position, not dates.
Source code in src/ml4t/engineer/dataset.py
to_numpy
¶
Convert full dataset to numpy arrays.
Returns¶
tuple[NDArray, NDArray] (features, labels) as numpy arrays.
Notes¶
This does NOT apply scaling. Use for raw data access only. For sklearn compatibility with scaling, use split() or train_test_split().
Source code in src/ml4t/engineer/dataset.py
to_pandas
¶
Convert full dataset to pandas.
Returns¶
tuple[pd.DataFrame, pd.Series] (features, labels) as pandas objects.
Notes¶
This does NOT apply scaling. Use for raw data access only.
Source code in src/ml4t/engineer/dataset.py
create_dataset_builder
¶
Convenience function to create MLDatasetBuilder with common defaults.
Parameters¶
features : pl.DataFrame Feature matrix. labels : pl.Series | pl.DataFrame Target variable. dates : pl.Series | None, optional Date/time index. scaler : BaseScaler | PreprocessingConfig | str | None, default "standard" Scaler to use. Options: - "standard": StandardScaler (z-score) - "minmax": MinMaxScaler ([0, 1]) - "robust": RobustScaler (median/IQR) - BaseScaler instance: Use provided scaler - PreprocessingConfig: Use config to create scaler - None: No scaling
Returns¶
MLDatasetBuilder Configured dataset builder.
Examples¶
builder = create_dataset_builder(features, labels, scaler="robust")
Using PreprocessingConfig for reproducibility¶
from ml4t.engineer.config import PreprocessingConfig config = PreprocessingConfig.robust(quantile_range=(10.0, 90.0)) builder = create_dataset_builder(features, labels, scaler=config)
Source code in src/ml4t/engineer/dataset.py
FoldResult
dataclass
¶
FoldResult(
X_train,
X_test,
y_train,
y_test,
train_indices,
test_indices,
fold_number,
scaler=None,
)
Result from a single cross-validation fold.
Attributes¶
X_train : pl.DataFrame Preprocessed training features. X_test : pl.DataFrame Preprocessed test features (using train statistics). y_train : pl.Series Training labels. y_test : pl.Series Test labels. train_indices : NDArray[np.intp] Original indices of training samples. test_indices : NDArray[np.intp] Original indices of test samples. fold_number : int Zero-indexed fold number. scaler : BaseScaler | None Fitted scaler used for this fold (None if no scaling).
to_numpy
¶
Convert to numpy arrays for sklearn compatibility.
Returns¶
tuple[NDArray, NDArray, NDArray, NDArray] (X_train, X_test, y_train, y_test) as numpy arrays.
Source code in src/ml4t/engineer/dataset.py
DatasetInfo
dataclass
¶
Information about the dataset.
Attributes¶
n_samples : int Number of samples. n_features : int Number of features. feature_names : list[str] Feature column names. label_name : str Label column name. has_dates : bool Whether dates are provided.
Preprocessing¶
preprocessing
¶
Preprocessing utilities for feature standardization with train-only fitting.
This module provides sklearn-like preprocessing transformers that maintain strict separation between training and test data statistics, preventing lookahead bias in ML pipelines.
Exports
StandardScaler - Z-score normalization (mean=0, std=1) MinMaxScaler - Scale to [0, 1] range RobustScaler - IQR-based scaling (outlier resistant) PreprocessingPipeline - Chain multiple transformers
ScalerMethod - Enum: STANDARD, MINMAX, ROBUST TransformType - Enum: SCALE, CLIP, WINSORIZE
Key Concepts: - Fit on training data only, transform both train and test - Polars-native implementation for performance - Immutable after fit (statistics locked) - Serializable for production deployment
Example
from ml4t.engineer.preprocessing import StandardScaler scaler = StandardScaler() train_scaled = scaler.fit_transform(train_df) test_scaled = scaler.transform(test_df) # Uses train statistics
StandardScaler
¶
Bases: BaseScaler
Z-score normalization: (x - mean) / std.
Transforms features to have mean=0 and std=1 using training data statistics.
Parameters¶
columns : list[str] | None Columns to scale. If None, all numeric columns are scaled. with_mean : bool, default True Center data by subtracting mean. with_std : bool, default True Scale data by dividing by std. ddof : int, default 1 Delta degrees of freedom for std calculation.
Examples¶
scaler = StandardScaler() train_scaled = scaler.fit_transform(train_df) test_scaled = scaler.transform(test_df) # Uses train mean/std
Source code in src/ml4t/engineer/preprocessing.py
MinMaxScaler
¶
Bases: BaseScaler
Scale features to [0, 1] range using min/max from training data.
Parameters¶
columns : list[str] | None Columns to scale. If None, all numeric columns are scaled. feature_range : tuple[float, float], default (0.0, 1.0) Desired range of transformed data.
Examples¶
scaler = MinMaxScaler() train_scaled = scaler.fit_transform(train_df) # [0, 1] range test_scaled = scaler.transform(test_df) # May exceed [0, 1]
Source code in src/ml4t/engineer/preprocessing.py
RobustScaler
¶
Bases: BaseScaler
Scale using median and IQR (robust to outliers).
Uses median instead of mean, and interquartile range (IQR) instead of std.
Parameters¶
columns : list[str] | None Columns to scale. If None, all numeric columns are scaled. with_centering : bool, default True Center data by subtracting median. with_scaling : bool, default True Scale data by dividing by IQR. quantile_range : tuple[float, float], default (25.0, 75.0) Quantile range for IQR calculation.
Examples¶
scaler = RobustScaler() train_scaled = scaler.fit_transform(train_df) test_scaled = scaler.transform(test_df)
Source code in src/ml4t/engineer/preprocessing.py
PreprocessingPipeline
¶
Apply preprocessing recommendations from ML4T Diagnostic.
This class enables bidirectional integration between ML4T Diagnostic and ML4T Engineer. After diagnostic evaluates features, it can recommend transforms which this pipeline applies with proper train/test separation.
The pipeline follows sklearn conventions: - fit(X): Learn statistics from training data only - transform(X): Apply transforms using fitted statistics - fit_transform(X): Combined fit and transform
Parameters¶
recommendations : dict | None Feature recommendations from FeatureEvaluatorConfig (ml4t-diagnostic). Format: {"feature_name": {"transform": "standardize", "confidence": 0.9}} min_confidence : float, default 0.0 Minimum confidence threshold for applying recommendations. Recommendations below this threshold default to NONE. winsorize_limits : tuple[float, float], default (0.01, 0.99) Percentile limits for winsorization.
Examples¶
From ML4T Diagnostic recommendations¶
recommendations = { ... "rsi_14": {"transform": "standardize", "confidence": 0.9}, ... "returns": {"transform": "winsorize", "confidence": 0.85}, ... "volume": {"transform": "log", "confidence": 0.8} ... } pipeline = PreprocessingPipeline.from_recommendations(recommendations) train_transformed = pipeline.fit_transform(train_df) test_transformed = pipeline.transform(test_df)
Serialize for production¶
pipeline_dict = pipeline.to_dict()
... save to disk ...¶
loaded_pipeline = PreprocessingPipeline.from_dict(pipeline_dict)
Initialize pipeline with recommendations.
Source code in src/ml4t/engineer/preprocessing.py
from_recommendations
classmethod
¶
Create pipeline from diagnostic recommendations.
Parameters¶
recommendations : dict Output from FeatureEvaluatorConfig (ml4t-diagnostic) or similar format. Expected structure: {"feature": {"transform": "...", "confidence": ...}} min_confidence : float, default 0.0 Minimum confidence threshold. winsorize_limits : tuple, default (0.01, 0.99) Percentile limits for winsorization.
Returns¶
PreprocessingPipeline Configured pipeline ready for fitting.
Source code in src/ml4t/engineer/preprocessing.py
fit
¶
Fit pipeline on training data.
Computes statistics needed for each transform from training data only.
Parameters¶
X : pl.DataFrame Training data with feature columns.
Returns¶
self Fitted pipeline.
Source code in src/ml4t/engineer/preprocessing.py
transform
¶
Transform data using fitted statistics.
Parameters¶
X : pl.DataFrame Data to transform.
Returns¶
pl.DataFrame Transformed data.
Raises¶
NotFittedError If pipeline has not been fitted.
Source code in src/ml4t/engineer/preprocessing.py
fit_transform
¶
to_dict
¶
Serialize pipeline state for persistence.
Returns¶
dict Serializable representation of fitted pipeline.
Source code in src/ml4t/engineer/preprocessing.py
from_dict
classmethod
¶
Load fitted pipeline from serialized state.
Parameters¶
data : dict Output from to_dict().
Returns¶
PreprocessingPipeline Reconstructed fitted pipeline.
Source code in src/ml4t/engineer/preprocessing.py
get_transform_summary
¶
Get summary of transforms to be applied.
Returns¶
dict Mapping of feature names to transform types.
Source code in src/ml4t/engineer/preprocessing.py
__repr__
¶
Return string representation.
TransformType
¶
Bases: str, Enum
Transform types supported by PreprocessingPipeline.
These align with ml4t.diagnostic.integration.engineer_contract.TransformType.
NotFittedError
¶
Bases: Exception
Raised when transform is called before fit.
Configuration¶
config
¶
ML4T Engineer Configuration System.
This module provides Pydantic v2 configuration schemas for feature engineering:
- Labeling: Triple barrier, ATR barrier, fixed horizon, trend scanning
- Preprocessing: Standard, MinMax, Robust scalers with create_scaler()
- Data Contract: Schema validation for input data
- Experiment: Experiment configuration and serialization
Note
Feature evaluation configs (StationarityConfig, ACFConfig, etc.) have moved
to ml4t-diagnostic. Install with: pip install ml4t-diagnostic
LabelingConfig
¶
Bases: BaseConfig
Unified configuration for all labeling methods.
Extends BaseConfig for full JSON/YAML serialization support.
Supports multiple labeling methods via the method discriminator.
All barrier distances are specified as POSITIVE values representing the distance from the entry price. The position side determines the direction of the barriers.
Attributes¶
method : str Labeling method: "triple_barrier", "atr_barrier", "fixed_horizon", "trend_scanning", "percentile" price_col : str Price column for barrier calculations (typically 'close') timestamp_col : str Timestamp column for duration calculations
Triple Barrier Parameters¶
upper_barrier : float | str | None Upper barrier distance or column name for dynamic barriers lower_barrier : float | str | None Lower barrier distance or column name for dynamic barriers max_holding_period : int | str Maximum holding period in bars or column name side : int | str | None Position side: 1 (long), -1 (short), 0/None (symmetric) trailing_stop : bool | float | str Enable trailing stop or specify percentage/column
ATR Barrier Parameters¶
atr_tp_multiple : float ATR multiplier for take profit (e.g., 2.0 = 2x ATR) atr_sl_multiple : float ATR multiplier for stop loss (e.g., 1.0 = 1x ATR) atr_period : int ATR calculation period (Wilder's default: 14)
Fixed Horizon Parameters¶
horizon : int Forward-looking period in bars return_method : str Return calculation: "returns", "log_returns", "binary" threshold : float | None Binary classification threshold
Trend Scanning Parameters¶
min_horizon : int Minimum lookforward period max_horizon : int Maximum lookforward period t_value_threshold : float T-statistic threshold for trend significance
Examples¶
Triple barrier with fixed barriers¶
config = LabelingConfig( ... method="triple_barrier", ... upper_barrier=0.02, ... lower_barrier=0.01, ... max_holding_period=20, ... side=1, ... ) config.to_yaml("config.yaml")
ATR-adjusted barriers¶
config = LabelingConfig.atr_barrier( ... atr_tp_multiple=2.0, ... atr_sl_multiple=1.0, ... max_holding_period=20, ... )
Load from file¶
config = LabelingConfig.from_yaml("config.yaml")
validate_side
classmethod
¶
Validate side is valid.
Source code in src/ml4t/engineer/config/labeling.py
validate_max_horizon
classmethod
¶
Ensure max_horizon >= min_horizon.
Source code in src/ml4t/engineer/config/labeling.py
triple_barrier
classmethod
¶
triple_barrier(
upper_barrier=0.02,
lower_barrier=0.01,
max_holding_period=20,
side=1,
trailing_stop=False,
**kwargs,
)
Create triple barrier labeling config.
Parameters¶
upper_barrier : float | str | None Take profit barrier (2% = 0.02) or column name lower_barrier : float | str | None Stop loss barrier (1% = 0.01) or column name max_holding_period : int | str | timedelta Maximum holding period: - int: Number of bars - str: Duration string ('4h', '1d') or column name - timedelta: Python timedelta object side : int | str | None Position direction: 1 (long), -1 (short) trailing_stop : bool | float | str Enable trailing stop
Returns¶
LabelingConfig Configured for triple barrier method
Examples¶
config = LabelingConfig.triple_barrier(0.02, 0.01, 20) config.to_yaml("triple_barrier.yaml")
Time-based max holding period¶
config = LabelingConfig.triple_barrier(0.02, 0.01, "4h")
Using timedelta¶
from datetime import timedelta config = LabelingConfig.triple_barrier(0.02, 0.01, timedelta(hours=4))
Source code in src/ml4t/engineer/config/labeling.py
atr_barrier
classmethod
¶
atr_barrier(
atr_tp_multiple=2.0,
atr_sl_multiple=1.0,
atr_period=14,
max_holding_period=20,
side=1,
trailing_stop=False,
**kwargs,
)
Create ATR-adjusted barrier labeling config.
Volatility-adaptive barriers that adjust to market conditions.
Parameters¶
atr_tp_multiple : float ATR multiplier for take profit (e.g., 2.0 = 2x ATR) atr_sl_multiple : float ATR multiplier for stop loss (e.g., 1.0 = 1x ATR) atr_period : int ATR calculation period (default: 14) max_holding_period : int | str | timedelta Maximum holding period: - int: Number of bars - str: Duration string ('4h', '1d') or column name - timedelta: Python timedelta object side : int | str | None Position direction: 1 (long), -1 (short) trailing_stop : bool Enable trailing stop
Returns¶
LabelingConfig Configured for ATR barrier method
Examples¶
config = LabelingConfig.atr_barrier(2.0, 1.0, 14) config.to_yaml("atr_barrier.yaml")
Time-based max holding period¶
config = LabelingConfig.atr_barrier(2.0, 1.0, 14, max_holding_period="4h")
Source code in src/ml4t/engineer/config/labeling.py
fixed_horizon
classmethod
¶
Create fixed horizon labeling config.
Simple forward-looking returns over a fixed period.
Parameters¶
horizon : int Forward-looking period in bars return_method : str "returns", "log_returns", or "binary" threshold : float | None Threshold for binary classification
Returns¶
LabelingConfig Configured for fixed horizon method
Examples¶
config = LabelingConfig.fixed_horizon(10, "binary", threshold=0.0)
Source code in src/ml4t/engineer/config/labeling.py
trend_scanning
classmethod
¶
Create trend scanning labeling config.
De Prado's trend scanning method using t-statistics.
Parameters¶
min_horizon : int Minimum lookforward period max_horizon : int Maximum lookforward period t_value_threshold : float T-statistic threshold for trend significance
Returns¶
LabelingConfig Configured for trend scanning method
Examples¶
config = LabelingConfig.trend_scanning(5, 20, 2.0)
Source code in src/ml4t/engineer/config/labeling.py
DataContractConfig
¶
Bases: BaseConfig
Canonical dataframe column mapping shared across ML4T libraries.
from_mapping
classmethod
¶
from_ml4t_data
classmethod
¶
Create a contract from ml4t-data's canonical multi-asset schema.
Source code in src/ml4t/engineer/config/data_contract.py
PreprocessingConfig
¶
Bases: BaseConfig
Configuration for preprocessing (feature scaling).
Extends BaseConfig for full JSON/YAML serialization support.
Use create_scaler() to instantiate the configured scaler.
Attributes¶
scaler : str | None Scaler type: "standard", "minmax", "robust", or None (no scaling) columns : list[str] | None Specific columns to scale (None = all numeric columns)
Standard Scaler Parameters¶
with_mean : bool Center features by removing the mean with_std : bool Scale features to unit variance
MinMax Scaler Parameters¶
feature_range : tuple[float, float] Target range for scaling (default: (0.0, 1.0))
Robust Scaler Parameters¶
with_centering : bool Center features using median with_scaling : bool Scale features using IQR quantile_range : tuple[float, float] Quantile range for IQR (default: (25.0, 75.0))
Examples¶
Standard scaling (z-score normalization)¶
config = PreprocessingConfig(scaler="standard") scaler = config.create_scaler() train_scaled = scaler.fit_transform(train_features) test_scaled = scaler.transform(test_features) # Uses train statistics
Robust scaling (outlier-resistant)¶
config = PreprocessingConfig.robust(quantile_range=(10.0, 90.0)) scaler = config.create_scaler()
Serialize for reproducibility¶
config.to_yaml("preprocessing.yaml")
standard
classmethod
¶
Create StandardScaler config.
Z-score normalization: (x - mean) / std
Parameters¶
with_mean : bool Center features by removing the mean with_std : bool Scale features to unit variance columns : list[str] | None Columns to scale (None = all)
Returns¶
PreprocessingConfig Configured for StandardScaler
Examples¶
config = PreprocessingConfig.standard() scaler = config.create_scaler()
Source code in src/ml4t/engineer/config/preprocessing_config.py
minmax
classmethod
¶
Create MinMaxScaler config.
Scales features to [min, max] range.
Parameters¶
feature_range : tuple[float, float] Target range for scaling (default: (0.0, 1.0)) columns : list[str] | None Columns to scale (None = all)
Returns¶
PreprocessingConfig Configured for MinMaxScaler
Examples¶
config = PreprocessingConfig.minmax(feature_range=(-1.0, 1.0)) scaler = config.create_scaler()
Source code in src/ml4t/engineer/config/preprocessing_config.py
robust
classmethod
¶
robust(
with_centering=True,
with_scaling=True,
quantile_range=(25.0, 75.0),
columns=None,
**kwargs,
)
Create RobustScaler config.
Uses median and IQR, making it robust to outliers.
Parameters¶
with_centering : bool Center features using median with_scaling : bool Scale features using IQR quantile_range : tuple[float, float] Quantile range for IQR (default: (25.0, 75.0)) columns : list[str] | None Columns to scale (None = all)
Returns¶
PreprocessingConfig Configured for RobustScaler
Examples¶
config = PreprocessingConfig.robust(quantile_range=(10.0, 90.0)) scaler = config.create_scaler()
Source code in src/ml4t/engineer/config/preprocessing_config.py
none
classmethod
¶
create_scaler
¶
Create scaler instance from stored parameters.
Returns¶
BaseScaler | None Configured scaler, or None if scaler="none"
Examples¶
config = PreprocessingConfig(scaler="standard") scaler = config.create_scaler() train_scaled = scaler.fit_transform(train_features) test_scaled = scaler.transform(test_features)
Source code in src/ml4t/engineer/config/preprocessing_config.py
ExperimentConfig
dataclass
¶
Container for experiment configuration components.
Holds typed configuration objects for all experiment components, loaded from a single YAML file.
Attributes¶
features : list[dict] Feature specifications for compute_features() labeling : LabelingConfig | None Labeling configuration (triple barrier, ATR, etc.) preprocessing : PreprocessingConfig | None Preprocessing/scaler configuration raw : dict Raw YAML content for any custom sections
Alternative Bars¶
bars
¶
Information-driven bars for financial data sampling.
This module implements various bar types that sample data based on information content rather than fixed time intervals:
Standard Event-Driven Bars: - Tick bars: Sample every N ticks - Volume bars: Sample when volume reaches threshold - Dollar bars: Sample when dollar value traded reaches threshold
Advanced Information-Driven Bars: - Imbalance bars: Sample based on order flow imbalance (tick, volume, dollar) - Run bars: Sample based on consecutive buy/sell runs (tick, volume, dollar)
The vectorized implementations are used by default for improved performance. Original implementations are available with the 'Original' suffix if needed.
Based on "Advances in Financial Machine Learning" by Marcos López de Prado.
BarSampler
¶
Bases: ABC
Abstract base class for bar samplers.
Bar samplers transform irregularly spaced tick data into regularly sampled bars based on various criteria (ticks, volume, etc).
sample
abstractmethod
¶
Sample bars from tick data.
Parameters¶
data : pl.DataFrame Tick data with columns: timestamp, price, volume, side include_incomplete : bool, default False Whether to include incomplete final bar
Returns¶
pl.DataFrame Sampled bars with OHLCV and additional information
Source code in src/ml4t/engineer/bars/base.py
FixedTickImbalanceBarSampler
¶
Bases: BarSampler
Sample bars using fixed tick imbalance threshold.
Unlike the adaptive AFML algorithm, this uses a fixed threshold that doesn't change during sampling. This avoids the threshold spiral issue that occurs with adaptive algorithms when order flow is imbalanced.
Recommended for production use - more stable and predictable than the adaptive version.
Parameters¶
threshold : int Fixed imbalance threshold. Bar forms when |Σ b_t| >= threshold. Typical values: 50-500 depending on desired bar frequency.
Calibration¶
To calibrate threshold for N bars per day: 1. Compute historical |mean imbalance| per tick 2. threshold ≈ ticks_per_day / N × |2P[b=1] - 1|
Or empirically: test a range and pick threshold giving desired bar count.
Examples¶
sampler = FixedTickImbalanceBarSampler(threshold=100) bars = sampler.sample(tick_data)
Notes¶
Advantages over adaptive (AFML) algorithm: - No threshold spiral with imbalanced order flow - Predictable bar count based on imbalance statistics - No feedback loops - stable by construction - Works consistently across all market conditions
Initialize fixed tick imbalance bar sampler.
Parameters¶
threshold : int Fixed imbalance threshold (positive integer)
Source code in src/ml4t/engineer/bars/imbalance.py
sample
¶
Sample fixed tick imbalance bars from data.
Parameters¶
data : pl.DataFrame Tick data with columns: timestamp, price, volume, side include_incomplete : bool, default False Whether to include incomplete final bar
Returns¶
pl.DataFrame Sampled tick imbalance bars
Source code in src/ml4t/engineer/bars/imbalance.py
913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 | |
FixedVolumeImbalanceBarSampler
¶
Bases: BarSampler
Sample bars using fixed volume imbalance threshold.
Unlike the adaptive AFML algorithm, this uses a fixed threshold that doesn't change during sampling. This avoids instability issues that occur with adaptive algorithms.
Recommended for production use - more stable and predictable than the adaptive version.
Parameters¶
threshold : float Fixed volume imbalance threshold. Bar forms when |Σ b_t × v_t| >= threshold. Typical values: 10,000-1,000,000 depending on stock and desired frequency.
Calibration¶
To calibrate threshold for N bars per day: 1. Compute historical |mean signed volume| per tick 2. threshold ≈ ticks_per_day / N × E[|signed_volume|]
Or empirically: test a range and pick threshold giving desired bar count.
Examples¶
sampler = FixedVolumeImbalanceBarSampler(threshold=50000) bars = sampler.sample(tick_data)
Notes¶
Advantages over adaptive (AFML) algorithm: - No threshold spiral or collapse - Predictable bar count based on volume imbalance statistics - No feedback loops - stable by construction - Works consistently across all market conditions
Initialize fixed volume imbalance bar sampler.
Parameters¶
threshold : float Fixed volume imbalance threshold (positive)
Source code in src/ml4t/engineer/bars/imbalance.py
sample
¶
Sample fixed volume imbalance bars from data.
Parameters¶
data : pl.DataFrame Tick data with columns: timestamp, price, volume, side include_incomplete : bool, default False Whether to include incomplete final bar
Returns¶
pl.DataFrame Sampled volume imbalance bars
Source code in src/ml4t/engineer/bars/imbalance.py
1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 | |
TickImbalanceBarSampler
¶
TickImbalanceBarSampler(
expected_ticks_per_bar,
alpha=0.1,
initial_p_buy=0.5,
min_bars_warmup=10,
)
Bases: BarSampler
Sample bars based on tick count imbalance (AFML-compliant TIBs).
Tick Imbalance Bars (TIBs) sample when the cumulative signed tick count (number of buys - number of sells) reaches a dynamically adjusted threshold.
AFML Threshold Formula
θ = Σ b_t (sum of trade signs) E[θ_T] = E[T] × |2P[b=1] - 1|
Where
E[T] = EWMA of bar lengths (ticks per bar) P[b=1] = probability of buy
This produces bar counts comparable to tick bars (both count ticks), unlike Volume Imbalance Bars which have thresholds scaled by volume.
Parameters¶
expected_ticks_per_bar : int Expected number of ticks per bar (used to initialize E[T]) alpha : float, default 0.1 EWMA decay factor for updating expectations initial_p_buy : float, default 0.5 Initial buy probability P[b=1] min_bars_warmup : int, default 10 Number of bars before starting EWMA updates
Examples¶
sampler = TickImbalanceBarSampler( ... expected_ticks_per_bar=1000, ... alpha=0.1 ... ) bars = sampler.sample(tick_data)
References¶
.. [1] López de Prado, M. (2018). Advances in Financial Machine Learning. John Wiley & Sons. Chapter 2.3: Information-Driven Bars.
Initialize tick imbalance bar sampler.
Parameters¶
expected_ticks_per_bar : int Expected number of ticks per bar alpha : float, default 0.1 EWMA decay factor initial_p_buy : float, default 0.5 Initial buy probability P[b=1] min_bars_warmup : int, default 10 Number of bars before starting EWMA updates
Source code in src/ml4t/engineer/bars/imbalance.py
sample
¶
Sample tick imbalance bars from data.
Parameters¶
data : pl.DataFrame Tick data with columns: timestamp, price, volume, side include_incomplete : bool, default False Whether to include incomplete final bar
Returns¶
pl.DataFrame Sampled tick imbalance bars with AFML diagnostic columns: - expected_t: E[T] at bar formation - p_buy: P[b=1] at bar formation - expected_imbalance: AFML threshold E[θ_T] - cumulative_theta: Actual tick imbalance at bar formation
Source code in src/ml4t/engineer/bars/imbalance.py
362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 | |
WindowTickImbalanceBarSampler
¶
Bases: BarSampler
Sample tick imbalance bars using window-based estimation.
Alternative to α-based EWMA that uses rolling windows instead of exponential decay for parameter estimation.
Key difference from α-based version: - E[T] computed from rolling mean of last N bar lengths - P[b=1] computed from rolling mean of last M tick signs - Old data falls out of windows → bounded adaptation → no threshold spiral
Parameters¶
initial_expected_t : int Initial expected ticks per bar (before first bar forms) bar_window : int, default 10 Number of recent bars to average for E[T] estimation tick_window : int, default 1000 Number of recent ticks to average for P[b=1] estimation
Examples¶
sampler = WindowTickImbalanceBarSampler( ... initial_expected_t=1000, ... bar_window=10, # E[T] from last 10 bars ... tick_window=5000, # P[b=1] from last 5000 ticks ... ) bars = sampler.sample(tick_data)
Notes¶
Recommended settings: - bar_window: 5-20 (small, since bar count is limited) - tick_window: 1000-10000 (large, for stable P[b=1] estimate) - initial_expected_t: Rough estimate of ticks per bar
Source code in src/ml4t/engineer/bars/imbalance.py
sample
¶
Sample window-based tick imbalance bars from data.
Source code in src/ml4t/engineer/bars/imbalance.py
1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 | |
WindowVolumeImbalanceBarSampler
¶
Bases: BarSampler
Sample volume imbalance bars using window-based estimation.
Alternative to α-based EWMA that uses rolling windows instead of exponential decay for parameter estimation.
Key difference from α-based version: - E[T] computed from rolling mean of last N bar lengths - Imbalance factor computed from rolling mean of last M signed volumes - Old data falls out of windows → bounded adaptation → no threshold spiral
Parameters¶
initial_expected_t : int Initial expected ticks per bar (before first bar forms) bar_window : int, default 10 Number of recent bars to average for E[T] estimation tick_window : int, default 1000 Number of recent ticks to average for imbalance estimation
Examples¶
sampler = WindowVolumeImbalanceBarSampler( ... initial_expected_t=5000, ... bar_window=10, # E[T] from last 10 bars ... tick_window=5000, # Imbalance from last 5000 ticks ... ) bars = sampler.sample(tick_data)
Source code in src/ml4t/engineer/bars/imbalance.py
sample
¶
Sample window-based volume imbalance bars from data.
Source code in src/ml4t/engineer/bars/imbalance.py
1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 | |
ImbalanceBarSamplerOriginal
¶
ImbalanceBarSamplerOriginal(
expected_ticks_per_bar,
alpha=0.1,
initial_p_buy=0.5,
min_bars_warmup=10,
)
Bases: BarSampler
Sample bars based on order flow imbalance (AFML-compliant).
Imbalance bars sample when the cumulative signed volume (buy - sell) reaches a dynamically adjusted threshold based on AFML Chapter 2.3.
AFML Threshold Formula
E[θ_T] = E[T] × |2v⁺ - E[v]|
Where
E[T] = EWMA of bar lengths (ticks per bar) v⁺ = P[b=1] × E[v|b=1] = expected buy volume contribution E[v] = unconditional mean volume per tick
Parameters¶
expected_ticks_per_bar : int Expected number of ticks per bar (used to initialize E[T]) alpha : float, default 0.1 EWMA decay factor for updating expectations initial_p_buy : float, default 0.5 Initial buy probability P[b=1] min_bars_warmup : int, default 10 Number of bars before starting EWMA updates
Examples¶
sampler = ImbalanceBarSampler( ... expected_ticks_per_bar=100, ... alpha=0.1 ... ) bars = sampler.sample(tick_data)
References¶
.. [1] López de Prado, M. (2018). Advances in Financial Machine Learning. John Wiley & Sons. Chapter 2.3: Information-Driven Bars.
Source code in src/ml4t/engineer/bars/imbalance.py
sample
¶
Sample imbalance bars from data.
Parameters¶
data : pl.DataFrame Tick data with columns: timestamp, price, volume, side include_incomplete : bool, default False Whether to include incomplete final bar
Returns¶
pl.DataFrame Sampled imbalance bars with AFML diagnostic columns: - expected_t: E[T] at bar formation - p_buy: P[b=1] at bar formation - v_plus: v⁺ = P[b=1] × E[v|b=1] at bar formation - e_v: E[v] at bar formation - expected_imbalance: AFML threshold E[θ_T] - cumulative_theta: Actual imbalance at bar formation
Source code in src/ml4t/engineer/bars/imbalance.py
594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 | |
DollarRunBarSampler
¶
Bases: BarSampler
Sample bars based on cumulative dollar value runs (AFML-compliant).
AFML Chapter 2.3 formula with dollar weighting: θ_T = max{Σ(buy dollars in bar), Σ(sell dollars in bar)}
Parameters¶
expected_ticks_per_bar : int Expected number of ticks per bar alpha : float, default 0.1 EWMA decay factor initial_p_buy : float, default 0.5 Initial buy probability P[b=1] min_bars_warmup : int, default 10 Number of bars before starting EWMA updates
Examples¶
sampler = DollarRunBarSampler(expected_ticks_per_bar=100) bars = sampler.sample(tick_data)
Source code in src/ml4t/engineer/bars/run.py
sample
¶
Sample dollar run bars from data.
Source code in src/ml4t/engineer/bars/run.py
603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 | |
TickRunBarSampler
¶
Bases: BarSampler
Sample bars based on cumulative tick runs (AFML-compliant).
AFML Chapter 2.3 formula: θ_T = max{Σ(all buys in bar), Σ(all sells in bar)} E[θ_T] = E[T] × max{P[b=1], 1-P[b=1]}
CRITICAL: Uses CUMULATIVE tick counts within the bar. Direction changes DO NOT reset the counts - only bar boundaries do.
Parameters¶
expected_ticks_per_bar : int Expected number of ticks per bar (used to initialize E[T]) alpha : float, default 0.1 EWMA decay factor for updating expectations initial_p_buy : float, default 0.5 Initial buy probability P[b=1] min_bars_warmup : int, default 10 Number of bars before starting EWMA updates
Examples¶
sampler = TickRunBarSampler(expected_ticks_per_bar=100) bars = sampler.sample(tick_data)
References¶
.. [1] López de Prado, M. (2018). Advances in Financial Machine Learning. John Wiley & Sons. Chapter 2.3: Information-Driven Bars.
Source code in src/ml4t/engineer/bars/run.py
sample
¶
Sample tick run bars from data.
Parameters¶
data : pl.DataFrame Tick data with columns: timestamp, price, volume, side include_incomplete : bool, default False Whether to include incomplete final bar
Returns¶
pl.DataFrame Sampled run bars with AFML diagnostic columns
Source code in src/ml4t/engineer/bars/run.py
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 | |
VolumeRunBarSampler
¶
Bases: BarSampler
Sample bars based on cumulative volume runs (AFML-compliant).
AFML Chapter 2.3 formula with volume weighting: θ_T = max{Σ(buy volumes in bar), Σ(sell volumes in bar)} E[θ_T] = E[T] × max{P[b=1], 1-P[b=1]} × E[v]
Where E[v] is estimated from the data.
Parameters¶
expected_ticks_per_bar : int Expected number of ticks per bar alpha : float, default 0.1 EWMA decay factor initial_p_buy : float, default 0.5 Initial buy probability P[b=1] min_bars_warmup : int, default 10 Number of bars before starting EWMA updates
Examples¶
sampler = VolumeRunBarSampler(expected_ticks_per_bar=100) bars = sampler.sample(tick_data)
Source code in src/ml4t/engineer/bars/run.py
sample
¶
Sample volume run bars from data.
Source code in src/ml4t/engineer/bars/run.py
418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 | |
TickBarSamplerOriginal
¶
Bases: BarSampler
Sample bars based on number of ticks.
Tick bars sample the data every N ticks, providing a more stable sampling rate compared to time bars during varying market activity.
Parameters¶
ticks_per_bar : int Number of ticks per bar
Examples¶
sampler = TickBarSampler(ticks_per_bar=100) bars = sampler.sample(tick_data)
Initialize tick bar sampler.
Parameters¶
ticks_per_bar : int Number of ticks per bar
Source code in src/ml4t/engineer/bars/tick.py
sample
¶
Sample tick bars from data.
Parameters¶
data : pl.DataFrame Tick data with columns: timestamp, price, volume include_incomplete : bool, default False Whether to include incomplete final bar
Returns¶
pl.DataFrame Sampled tick bars
Source code in src/ml4t/engineer/bars/tick.py
DollarBarSampler
¶
Bases: BarSampler
Vectorized dollar bar sampler using Polars.
Parameters¶
dollars_per_bar : float Target dollar value per bar
Source code in src/ml4t/engineer/bars/vectorized.py
sample
¶
Sample dollar bars using vectorized operations.
Source code in src/ml4t/engineer/bars/vectorized.py
ImbalanceBarSampler
¶
Bases: BarSampler
Vectorized imbalance bar sampler with AFML-compliant adaptive thresholds.
This implementation uses vectorized operations for the main logic while keeping the adaptive threshold calculation efficient.
AFML Threshold Formula
E[θ_T] = E[T] × |2v⁺ - E[v]|
Where
E[T] = EWMA of bar lengths (ticks per bar) v⁺ = P[b=1] × E[v|b=1] = expected buy volume contribution E[v] = unconditional mean volume per tick
Parameters¶
expected_ticks_per_bar : int Expected number of ticks per bar (initializes E[T]) alpha : float, default 0.1 EWMA decay factor for updating expectations initial_p_buy : float, default 0.5 Initial buy probability P[b=1] min_bars_warmup : int, default 10 Number of bars before starting EWMA updates
Source code in src/ml4t/engineer/bars/vectorized.py
sample
¶
Sample imbalance bars using vectorized operations where possible.
Source code in src/ml4t/engineer/bars/vectorized.py
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 | |
TickBarSampler
¶
Bases: BarSampler
Vectorized tick bar sampler using Polars.
Parameters¶
ticks_per_bar : int Number of ticks per bar
Source code in src/ml4t/engineer/bars/vectorized.py
sample
¶
Sample tick bars using vectorized operations.
Source code in src/ml4t/engineer/bars/vectorized.py
VolumeBarSampler
¶
Bases: BarSampler
Vectorized volume bar sampler using Polars.
This implementation replaces Python loops with vectorized Polars operations for dramatically improved performance on large datasets.
Parameters¶
volume_per_bar : float Target volume per bar
Source code in src/ml4t/engineer/bars/vectorized.py
sample
¶
Sample volume bars using vectorized operations.
Source code in src/ml4t/engineer/bars/vectorized.py
DollarBarSamplerOriginal
¶
Bases: BarSampler
Sample bars based on dollar value traded.
Dollar bars sample when the cumulative dollar value (price * volume) reaches a threshold, providing adaptive sampling based on both price and volume.
Parameters¶
dollars_per_bar : float Target dollar value per bar
Examples¶
sampler = DollarBarSampler(dollars_per_bar=1_000_000) bars = sampler.sample(tick_data)
Initialize dollar bar sampler.
Parameters¶
dollars_per_bar : float Target dollar value per bar
Source code in src/ml4t/engineer/bars/volume.py
sample
¶
Sample dollar bars from data.
Parameters¶
data : pl.DataFrame Tick data with columns: timestamp, price, volume include_incomplete : bool, default False Whether to include incomplete final bar
Returns¶
pl.DataFrame Sampled dollar bars with VWAP
Source code in src/ml4t/engineer/bars/volume.py
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | |
VolumeBarSamplerOriginal
¶
Bases: BarSampler
Sample bars based on volume traded.
Volume bars sample when the cumulative volume reaches a threshold, providing more samples during high activity periods.
Parameters¶
volume_per_bar : float Target volume per bar
Examples¶
sampler = VolumeBarSampler(volume_per_bar=10000) bars = sampler.sample(tick_data)
Initialize volume bar sampler.
Parameters¶
volume_per_bar : float Target volume per bar
Source code in src/ml4t/engineer/bars/volume.py
sample
¶
Sample volume bars from data.
Parameters¶
data : pl.DataFrame Tick data with columns: timestamp, price, volume, side include_incomplete : bool, default False Whether to include incomplete final bar
Returns¶
pl.DataFrame Sampled volume bars with buy/sell volume breakdown
Source code in src/ml4t/engineer/bars/volume.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | |
Next Steps¶
- Read Features for the main computation workflow.
- Read Labeling for supervised target construction.
- Read Alternative Bars for information-driven sampling.
- Use the Book Guide to map these APIs back to the book and case studies.