US Firm Characteristics

Classic factor investing with ML on monthly fundamental data

Fundamentals Monthly Fundamental Data

ETF Cross-Asset Exposures Crypto Perpetuals Funding NASDAQ-100 Microstructure S&P 500 Equity + Option Analytics FX Spot Pairs CME Futures S&P 500 Options (Straddles) US Equities Panel

Methodology Highlight

Demonstrates how label engineering, preprocessing choices, and point-in-time data management affect results as much as model selection — with the strongest statistical foundation (10 CV folds) in the book.

This case study applies ML to the canonical factor investing question: can machine learning improve on traditional long-short decile sorts when accounting lags, survivorship bias, and transaction costs are taken seriously? Working with 57 firm-level characteristics spanning valuation, profitability, momentum, and risk across approximately 2,500 US stocks (1996-2016), this is the most feature-rich fundamental dataset in the book.
Students learn point-in-time data management with accounting publication lags, the impact of label engineering choices (classification vs regression), and the effect of preprocessing decisions like winsorization on model performance. The case study provides the strongest statistical foundation with 10 CV folds, enabling students to assess the reliability of walk-forward results.
The fundamental universe is the natural home for latent factor models (IPCA, CAE, SAE, SDF). Students learn how these architectures extract time-varying factor loadings from cross-sectional data, and how capacity constraints in small-cap stocks limit the practical scalability of strategies that appear strong on paper.

Strategy Summary

Long-short decile sort on approximately 2,500 US stocks using ML predictions on winsorized forward returns. Monthly rebalancing with 6-month characteristic lag enforced for point-in-time compliance. Equal-weight within deciles, dollar-neutral. 10 CV folds with 10-year training and 1-year validation windows provide the most robust statistical evaluation in the book.

Data Sources

NASDAQ Data Link (firm characteristics)

ML Techniques

GBM classification Latent factor models (IPCA, CAE, SAE, SDF) Label engineering (classification vs regression) Cross-sectional prediction

ML Pipeline

Universe & Setup

1 notebook

~2,500 US stocks filtered by price >$5 and ADV >$1M. Monthly decision cadence with 6-month accounting lag enforced for point-in-time compliance. The longest history (1996-2016) and most CV folds (10, with 10Y train / 1Y val) of any case study. Era-dependent costs: pre-decimalization spreads 15-30 bps, post-2001 spreads 5-15 bps. Anonymized NASDAQ Data Link characteristics.

Universe & Protocol Setup Ch 6

Defines the trading universe (US stocks filtered by price >$5 and ADV >$1M), monthly decision cadence, and 6-month accounting lag convention. Validates feasibility against transaction costs and builds walk-forward evaluation splits with purging. Documents the anonymized PERMNO-based identifier system.

Labels & Evaluation

2 notebooks

1-month forward return as primary label with a winsorized classification variant that produces the largest label engineering effect in the book: IC jumps ninefold from 0.008 to 0.074. 3-month variant for horizon sensitivity. No temporal features (no Ch9 notebook) -- the 57 firm characteristics are rich enough. Evaluation validates classification dominance across all feature families.

Label Engineering Ch 7

Computes 1-month forward returns and classification labels from the Chen- Pelger-Zhu dataset where characteristics at month t-1 are paired with returns during month t. Creates winsorized and median-split classification variants. The winsorized classification label (fwd_class_1m) produces a dramatic IC improvement over raw returns -- the largest label engineering effect in the book.

Feature Evaluation Ch 7

Evaluates financial features with HAC-adjusted IC against 1-month forward returns. Applies Benjamini-Hochberg FDR to control false discovery across the factor zoo. Assesses quantile monotonicity and fold-level stability across walk-forward folds. Produces triage decisions (PROCEED/REVISE/STOP) for Ch11 model selection.

Feature Engineering

1 notebook

57 pre-computed Chen-Pelger-Zhu firm characteristics spanning five families: fundamentals (book-to-market, earnings yield, cash flow), profitability (ROE, ROA, gross margin), momentum (12-1 month, 6-month, reversal), risk (beta, idiosyncratic vol, leverage), and investment (asset growth, R&D, CapEx). No temporal feature notebook -- the feature set is entirely cross-sectional.

Feature Engineering Ch 8

Maps pre-computed Chen-Pelger-Zhu firm characteristics into five economic factor families: value (BEME, E2P), quality/profitability (ROE, OP), investment (NOA, DPI2A), momentum (r12_2, ST_REV), and risk/liquidity (Beta, IdioVol). Engineers composite factor scores (value+quality, value+momentum) and interaction features (value x quality, momentum x ivol) for downstream modeling.

Modeling

9 notebooks

The natural home for latent factor models with 5 dedicated notebooks: standard latent factors, IPCA (time-varying loadings), CAE (non-linear generalization of IPCA), SAE (return-targeted extraction), and SDF (pricing kernel from characteristics). GBM achieves IC +0.070 on classification labels. Classification dominates regression everywhere. Causal DML tests whether 12-1 momentum causes returns or reflects confounders.

Linear Models Ch 11

Trains Ridge, LASSO, and ElasticNet on the firm characteristics panel with walk-forward CV (10-year training windows). Reveals that classification labels dominate regression: winsorized classification produces dramatically higher IC than raw returns. Establishes the linear baseline that GBM and latent factor models must beat.

Gradient Boosting Ch 12

Trains LightGBM across leaf-count profiles and loss functions on fundamental and price features. Searches for value-quality-momentum interactions following Freyberger, Neuhierl, and Weber (2020). GBM on winsorized returns substantially outperforms the ridge baseline. The IC gap between raw and winsorized labels reinforces that label treatment is more impactful than model architecture.

Tabular Deep Learning (TabM) Ch 12

Trains TabM rank-1 adapter MLP ensemble (small/medium/large) on the flat characteristic feature matrix as a neural alternative to GBM. Compares attention-based tabular DL against tree-based models where GBM already leads on both winsorized returns and classification labels.

Latent Factor Models Ch 14

Runs PCA, IPCA, CAE, SDF, and SAE latent factor extraction on the richest characteristic panel in the book. Compares pricing-theory-motivated (SDF) vs statistical (CAE/SAE) factor extraction and evaluates whether dimensionality reduction improves monthly return prediction over direct GBM/TabM.

IPCA Ch 14

Fits Instrumented PCA (Kelly, Pruitt, and Su 2019) with time-varying factor loadings as linear functions of observable characteristics. The natural model for this dataset: anonymized stock_id structure prevents standard PCA from tracking entities across months, making IPCA's characteristic-conditional approach essential. Estimates and interprets the Gamma matrix.

Conditional Autoencoder (CAE) Ch 14

Trains the Conditional Autoencoder (Gu, Kelly, and Xiu 2021), replacing IPCA's linear Gamma mapping with a neural network encoder for characteristic-to-factor-loading estimation. Evaluates whether non-linear interactions between characteristics (value-quality-momentum) improve prediction over the linear IPCA baseline on identical folds.

Supervised Autoencoder (SAE) Ch 14

Trains the Supervised Autoencoder combining reconstruction loss (preserve characteristic structure) with prediction loss (forecast returns) in a multi-task framework. The reconstruction term regularizes latent representations against overfitting to noisy monthly returns. Completes the autoencoder comparison before SDF provides the pricing-theory perspective.

SDF Network Ch 14

Trains the Stochastic Discount Factor network to learn portfolio weights satisfying the no-arbitrage condition E[M_t * R_{i,t}] = 0. Identifies which firms are exposed to systematic risk -- a pricing-theory complement to the supervised prediction objective used by IPCA, CAE, and SAE.

Causal DML Ch 15

Applies DML to estimate the causal effect of 12-month momentum (r12_2) on monthly returns, conditioning on Fama-French confounders (Beta, IdioVol, LME, Variance). Finds a significant causal effect with partial confounding absorption -- supporting the behavioral explanation over the risk-premium explanation.

Strategy Pipeline

5 notebooks

Long-short decile strategy simulation. Holdout Sharpe +2.52 (17% decay -- mildest in book). Risk overlay lifts managed Sharpe to +3.92. Era-dependent cost analysis spanning pre- and post-decimalization. Critical capacity finding: 69% of long leg in bottom market-cap quartile where spreads run 100-500 bps.

Model Analysis Ch 11

Compares all model families on the classic factor zoo dataset. TabM leads on raw returns with the only compelling standalone decile spread. GBM on winsorized returns outperforms all family-label combinations. Documents that label design dominates model choice. Four families advance (TabM, GBM, SDF, causal DML), one excluded (linear).

Backtest & Signal Evaluation Ch 16

Runs plumbing test, parametric sweep across all prediction-signal combinations, and statistical analysis (DSR, family comparison) for a long- short decile strategy on US stocks. Translates the classification signal into simulated portfolio returns across the full backtest sweep with walk-forward CV.

Portfolio: Allocator Sweep Ch 17

Sweeps top predictions x TOP_K concentration x 6 allocators (equal-weight through HRP) on the stock universe. Evaluates concentration vs diversification tradeoffs: holding too few stocks sacrifices diversification, while holding too many dilutes signal. Compares risk-parity and score-weighted allocation against equal-weight.

Transaction Costs Ch 18

Sweeps the cost grid on top allocation combinations, confirming the most favorable cost profile in the book: monthly rebalancing provides much lower turnover than daily strategies with ample headroom. Analyzes era-dependent costs spanning pre- and post-decimalization.

Risk Management Ch 19

Sweeps position-level (stop-loss, trailing stops) and portfolio-level (drawdown breakers, daily loss limits) risk controls on the strongest strategy in the book. Evaluates whether risk overlays add robustness to an already stable monthly-cadence strategy with modest holdout decay.

Synthesis & Verdict

1 notebook

Strongest integrated result in the book. Verdict: Advance -- but filter universe to top three market-cap quartiles and re-evaluate under realistic capacity constraints. Factor attribution confirms genuine alpha (t=7.09).

Strategy Analysis Ch 20

Synthesizes the end-to-end results for the book's strongest case study. GBM champion on fwd_class_1m produces strong holdout Sharpe before capacity constraints. Produces the "advance" verdict with a capacity caveat: the long leg concentrates in small-cap stocks, requiring universe filtering for realistic deployment.

S&P 500 Equity + Option Analytics

FX Spot Pairs

Quick Info

Asset Class Fundamentals

Frequency Monthly

Data Type Fundamental Data

Notebooks 19

Chapters 12

Libraries Used

ML4T Data ML4T Engineer

Chapters

6 Strategy Research Framework 7 Defining the Learning Task 8 Financial Feature Engineering 11 The ML Pipeline 12 Advanced Models for Tabular Data 14 Latent Factor Models 15 Causal Machine Learning 16 Strategy Simulation 17 Portfolio Construction 18 Transaction Costs 19 Risk Management 20 Strategy Synthesis