NASDAQ-100 Microstructure

Intraday microstructure signals across 114 stocks at 15-minute frequency

Equities 15-Minute Microstructure

ETF Cross-Asset Exposures Crypto Perpetuals Funding S&P 500 Equity + Option Analytics US Firm Characteristics FX Spot Pairs CME Futures S&P 500 Options (Straddles) US Equities Panel

Methodology Highlight

Shows how rebalance frequency and cost regime — not model quality — determine whether a statistically significant signal translates into a viable strategy.

This is the highest-frequency case study in the book, using AlgoSeek TAQ-derived 15-minute bars for 114 NASDAQ-100 constituents. Students learn to build features from order flow, quote staleness, relative spreads, and other microstructure indicators — the richest feature space in the book with 88 features from 4 microstructure families.
The case study demonstrates how rebalance frequency and transaction costs interact to determine strategy viability. The same prediction signal can be worthless or profitable depending on how frequently positions are adjusted and what cost model is applied. Students learn to use per-share cost models rather than basis-point approximations for intraday strategies.
With the largest training set in the book (13M+ observations across 114 stocks), this case study also illustrates the practical challenges of working with high-frequency data: storage, computation time, and the difference between statistical significance and economic significance.

Strategy Summary

Intraday rank-and-trade across 114 NASDAQ-100 stocks using 15-minute bars. Dollar-neutral long-short with equal-weight sizing within legs. Cost model includes spread, market impact, and commission components. Walk-forward evaluation uses 2 folds over 2020-2021. The critical design choice is rebalance cadence — the strategy is evaluated at multiple frequencies to understand the cost-signal tradeoff.

Data Sources

AlgoSeek (TAQ-derived bars)

ML Techniques

Order flow features Microstructure indicators Classification vs regression labels GBM with large-scale data

ML Pipeline

Universe & Setup

1 notebook

114 NASDAQ-100 constituents from AlgoSeek TAQ minute bars, aggregated to 15-minute intervals. Highest-frequency case in the book. Dollar-neutral constraint, 5+ bps friction floor that dominates the cost model. Only 2 CV folds (6-month train/val) due to limited 2020-2021 history. Execution delay is 1 bar (15 minutes).

Universe & Protocol Setup Ch 6

Defines the NASDAQ-100 universe with 15-minute bar aggregation, close definition (last trade in bar), 1-bar execution delay, and next-bar-open fill assumption. Performs friction hurdle analysis comparing per-bar return distributions against typical large-cap spreads. Builds 2-fold walk-forward splits with purge/embargo calibrated for 15-minute horizon.

Labels & Evaluation

2 notebooks

15-minute forward return as primary label using midprice (not last trade) to reduce microstructure noise. 5-minute and 60-minute variants for cadence analysis. 88 features (66 financial + 22 temporal) evaluated. Classification labels dominate regression -- a pattern first discovered here and repeated across case studies.

Label Engineering Ch 7

Computes midprice-based forward returns at 15-minute (primary), 5-minute, and 60-minute horizons using NBBO quotes to avoid bid-ask bounce contamination. Applies session-bounded forward returns to prevent overnight gap leakage. Builds a ternary classification label (fwd_dir_15m) with a threshold and generates walk-forward CV configuration.

Feature Evaluation & Triage Ch 7

Evaluates all features (financial + temporal) against 15-minute forward midprice returns using HAC-adjusted IC (session-length Newey-West for overlap) with BH-FDR correction. Screens for coverage and staleness, assesses quantile monotonicity and cross-feature redundancy. Triages features into PROCEED / REVISE / STOP categories.

Feature Engineering

2 notebooks

66 microstructure features across four families: quote-based liquidity (midprice, spread dynamics, depth imbalance), trade-based flow (signed volume share, trade intensity), intraday volatility (Garman-Klass, realized variance), and regime proxies. 22 temporal features including HAR(5,15,60) volatility decomposition unique to intraday data. The richest per-bar feature set in the book.

Microstructure Feature Engineering Ch 8

Engineers features across four microstructure families from AlgoSeek minute bars: quote-based liquidity (midprice, spread, depth imbalance), order flow proxies (signed volume, tick imbalance, microprice deviation), volatility/impact (realized vol, Kyle's lambda, Amihud), and hidden liquidity (FINRA dark pool share). Computes multi-resolution features at 1-bar, 5-bar, 15-bar, and 60-bar lookbacks with 5-bar staleness caps on quote-derived fields.

Temporal Features Ch 9

Constructs temporal features using three approaches tailored to intraday data: HAR(5,15,60) volatility decomposition for multi-scale realized volatility, rolling FFT spectral features on volume and volatility profiles, and depth-2 path signatures on (price, signed_vol, trades) trajectories using causal rolling windows.

Modeling

7 notebooks

4 deep learning architectures tested (most of any case study): NLinear as minimal baseline, LSTM for order-flow memory, TCN for dilated causal convolutions, PatchTST for multi-scale attention. GBM trains on 13M+ samples (largest training set). Best IC only +0.008 (weakest in book). Causal DML tests whether signed volume share causes 15-minute returns.

Linear Models Ch 11

Trains Ridge, LASSO, and ElasticNet via walk-forward CV on microstructure features across NASDAQ-100 symbols at 15-minute frequency. Establishes the linear IC baseline for downstream model comparison and registers predictions for backtesting.

Gradient Boosting Ch 12

Trains LightGBM at 15-minute frequency across regularization profiles and loss functions (MSE, MAE, Huber). Evaluates IC at 50-iteration checkpoints and compares against the linear baseline.

NLinear (DL Baseline) Ch 13

Trains NLinear (last-value normalization plus single linear map) as the minimal temporal DL baseline for the NASDAQ-100 microstructure case. Establishes clean provenance for architecture comparison against LSTM, TCN, and PatchTST.

LSTM Ch 13

Trains LSTM with gated memory on 60-bar lookback windows of microstructure features at 15-minute frequency. Tests whether recurrent memory captures short-lived order-flow and spread dynamics beyond memoryless models. Compares against NLinear, Ridge, and GBM results.

TCN Ch 13

Trains dilated causal TCN on 15-minute microstructure features. Tests whether multi-scale convolutional receptive fields capture intraday temporal dynamics that recurrent memory misses. Compares against LSTM, NLinear, and prior baselines on the same 15-minute label.

PatchTST Ch 13

Trains PatchTST with multi-scale patch attention on 15-minute bar sequences. Tests whether patching plus self-attention captures the microstructure edge more effectively than NLinear, LSTM, and TCN. Produces the final DL candidate for downstream model analysis and backtesting.

Causal DML Estimation Ch 15

Applies DML to signed volume share (buyer-initiated volume fraction via Lee-Ready) as treatment across NASDAQ-100 stocks at 15-minute frequency, with confounders: relative spread, 5-minute realized volatility, and 1-month cumulative return. Runs placebo refutation tests and quantifies confounding bias.

Strategy Pipeline

5 notebooks

Despite the weakest IC, produces the highest Sharpe (+4.22) through frequency multiplication: 26 daily decisions x 114 stocks = massive effective breadth. The same signal is worthless at per-bar frequency and profitable at hourly. Per-share cost models (not basis-point sweeps) reveal the relevant budget. Flagship cost analysis notebook.

Cross-Model Analysis Ch 11

Compares all model families (linear, GBM, DL, causal) on the highest-frequency case study using registry metrics, fold stability diagnostics, prediction bucket monotonicity, and regime-conditional IC. Produces per-family advancement recommendations for Ch16 backtesting across 2 walk-forward folds.

Signal-Stage Backtest Ch 16

Runs plumbing test (random signal verification), then sweeps all predictions across signal methods using the ml4t-backtest engine with 15-minute OHLCV bars from AlgoSeek TAQ. Computes DSR, family comparison, and IC-to-Sharpe translation statistics. Registers results for downstream allocation and cost analysis.

Portfolio Allocator Sweep Ch 17

Sweeps top signal-stage predictions across TOP_K concentration levels and allocators (equal-weight, score-weighted, inverse-vol, risk-parity) for the NASDAQ-100 universe. Skips MVO/HRP due to computational cost at 15-minute cadence. Dollar-neutral constraint throughout.

Transaction Cost Analysis (Cadence Sweep) Ch 18

Runs two cost analyses: a standard bps cost grid on top allocation combos, and a cadence-by-per-share cost sweep that varies rebalance frequency from 15-minute to 4-hourly. Uses a per-share cost model ($/share) rather than bps, which is more realistic for equities. Identifies the viable implementation regime where signal survives execution friction.

Risk Controls Ch 19

Sweeps position-level (stop-loss, trailing stop, time exit) and portfolio-level (drawdown breaker, daily loss limit) risk controls on top allocation combos with 15-minute OHLCV bars. Measures how each overlay modifies the equity curve and drawdown profile at intraday granularity.

Synthesis & Verdict

1 notebook

Weakest IC, highest Sharpe -- the Fundamental Law of Active Management in action. Holdout Sharpe +2.05 (51% decay, still strong). Verdict: Advance, but cadence optimization is the priority, not model improvement.

Strategy Synthesis & Verdict Ch 20

Assembles the full NASDAQ-100 pipeline verdict by tracing the DL regression champion through signal, allocation, cost, and risk stages via BacktestExplorer. Computes holdout performance, search risk accounting, and factor attribution. Produces a structured deployment verdict with cadence optimization as binding constraint.

Crypto Perpetuals Funding

S&P 500 Equity + Option Analytics

Quick Info

Asset Class Equities

Frequency 15-Minute

Data Type Microstructure

Notebooks 18

Chapters 13

Libraries Used

ML4T Data ML4T Engineer

Chapters

6 Strategy Research Framework 7 Defining the Learning Task 8 Financial Feature Engineering 9 Model-Based Feature Extraction 11 The ML Pipeline 12 Advanced Models for Tabular Data 13 Deep Learning for Time Series 15 Causal Machine Learning 16 Strategy Simulation 17 Portfolio Construction 18 Transaction Costs 19 Risk Management 20 Strategy Synthesis