Learning Objectives
- Explain why recurrent sequence models became a computational and optimization bottleneck for long-context forecasting tasks
- Compare the main temporal modeling philosophies — decomposition-based, attention-based, state-space, and strong linear baselines — and explain when each is most appropriate
- Use strong baselines and diagnostics, including linear models and walk-forward evaluation, to judge whether sequence-model complexity is warranted
- Distinguish the design logic of modern time-series Transformer variants, including PatchTST, iTransformer, and TFT, and relate those choices to multivariate structure, covariates, and forecast horizon
- Decide when a financial prediction problem should be framed as direct panel regression with sequential inputs rather than multi-step time-series forecasting
- Evaluate time-series foundation model adaptation modes for financial applications, including the implications of transfer mismatch and pretraining contamination
- Apply practical uncertainty estimation methods, including MC Dropout and deep ensembles, to support risk-aware trading decisions
The Recurrent Paradigm and Its Discontents
The section examines LSTMs and GRUs as the historical default for time series forecasting, explaining the gating mechanism that addresses vanishing gradients but identifying two fundamental limitations that motivated the search for alternatives: the sequential computation bottleneck (O(T) dependency preventing temporal parallelism) and gradient flow degradation over very long sequences. It argues that financial data's long-memory properties — volatility clustering, persistent order flow, calendar effects spanning hundreds of trading days — push beyond what recurrent architectures can reliably capture through gradient-based learning.
1 notebook
The Decomposition Philosophy: N-BEATS
N-BEATS encodes the inductive bias that time series are compositions of interpretable components (trend and seasonality) that neural networks should decompose explicitly rather than discover implicitly, using a hierarchical architecture of blocks, stacks, and doubly-residual connections that parallels boosting in tree-based methods. The section explains how basis expansion — predicting through learned coefficients weighting polynomial (trend) or Fourier (seasonality) basis functions — generalizes classical decomposition while enabling transparency rare among deep learning models. It covers the interpretable configuration (constrained trend and seasonality stacks) and extensions (N-BEATSx for exogenous variables, N-HiTS for multi-resolution forecasting), with practical caveats about lookback-to-horizon sensitivity and cases where financial regime changes violate both polynomial and Fourier assumptions.
1 notebook
The Attention Revolution for Time Series
The section covers the adaptation of the Transformer architecture from NLP to time series through three key modifications: patching (segmenting input windows into subsequences to reduce computational cost), positional encoding (injecting temporal order into the permutation-invariant attention mechanism), and encoder-decoder structure with the choice between iterative and direct multi-horizon forecasting. It presents the self-attention mechanism's ability to create direct connections between any two positions regardless of distance, and surveys early efficiency-focused variants (Informer, Autoformer, FEDformer) that shared the assumption of tokenizing time series like language — an assumption that Section 13.4 would challenge.
The Great Debate: When Simple Outperforms Complex
Zeng et al. (2022) demonstrated that single-layer linear models (Linear, D-Linear, N-Linear) outperformed Transformer architectures by 20-50% across all nine LTSF benchmark datasets, with diagnostic experiments showing that Transformers were largely ignoring temporal order — performance was remarkably unaffected when input sequences were randomly shuffled. The section explains the core critique that self-attention's permutation-invariance is a fundamental mismatch for time series where temporal ordering itself carries the predictive signal. While acknowledging counterpoints (multivariate settings, tuning sensitivity, scaling potential), it establishes that any new architecture must now beat LTSF-Linear baselines, shifting the field's goal from making Transformers bigger to making them smarter.
1 notebook
The Transformer's Evolution
The section presents three post-critique Transformer architectures that address the temporal inductive bias problem through different design choices: PatchTST (treating patches as tokens with channel-independence as regularization and RevIN for non-stationarity), iTransformer (inverting the attention dimension so it operates across variables rather than time steps, sidestepping the temporal order critique entirely), and TFT (learned variable selection for covariate-rich settings with native multi-horizon quantile output). It provides a comparative summary showing how each architecture targets different forecasting scenarios, while noting that on pre-engineered features, IC differences between architectures are often modest — confirming that the marginal benefit of temporal modeling depends on how much signal the feature pipeline has already extracted.
1 notebook
The Full Toolkit: Alternative Architectures and Foundation Models
The section surveys TCNs, TSMixer, CNN-based approaches, hybrid statistical-neural models, and state space models (Mamba), positioning SSMs as particularly promising for very long sequences (>10K steps) due to their linear O(L) complexity with selective state spaces that filter noise contextually. The extensive treatment of time series foundation models covers the evolution from first-generation univariate-only models to second-generation multivariate systems (Chronos-2, Moirai-MoE), four adaptation modes (zero-shot, in-context learning, PEFT, model selection/ensembling), and the efficiency frontier challenging the scaling paradigm. The critical finance transfer gap evidence shows that off-the-shelf TSFMs underperform tree-based ensembles on return prediction but show genuine promise for volatility and VaR forecasting where target structure is more transferable.
5 notebooks
A Practitioner's Framework
This section synthesizes the chapter's evidence into a three-step model selection process: establish strong baselines (seasonal naive through GBMs, with sample-size thresholds and foundation model tiers), diagnose the problem along five axes (univariate vs multivariate, known vs unknown relationships, forecast horizon, interpretability needs, sequence length), and apply a selection matrix matching problem characteristics to recommended architectures. The cross-dataset evidence shows the split is roughly even between tabular and DL models across eight case studies, with the practical decision rule being to test whether temporal ordering adds incremental signal beyond lag-feature engineering. The section also surveys the library landscape (Darts, NeuralForecast, PyTorch Forecasting, GluonTS, sktime) with guidance on when each is appropriate.
2 notebooks
Quantifying Prediction Uncertainty
The section develops two practical approaches for uncertainty estimation from deep learning models: MC Dropout (keeping dropout active at inference and running multiple forward passes to approximate Bayesian posterior sampling) and Deep Ensembles (training models with different initializations and using their disagreement as an uncertainty measure with formal epistemic-aleatoric decomposition). It shows that MC Dropout uncertainty shows meaningful correlation with actual prediction errors, making it practically useful for filtering unreliable forecasts, while deep ensembles of 5 members often provide the best tradeoff between accuracy, calibration, and compute cost. The section notes that foundation models exhibit systematic miscalibration under financial regime shifts, recommending conformal calibration from Chapter 11 as a distribution-free correction.
1 notebook
Cross-Dataset Insights
Aggregating walk-forward results across eight case studies against Ridge and GBM baselines produces a sobering headline: DL rarely outperforms strong tabular baselines. The clearest DL-positive case is crypto perpetuals funding (LSTM IC +0.030 vs GBM +0.023), where the 8-hourly frequency generates temporal structure not fully captured by cross-sectional features. No single DL architecture dominates — LSTM, PatchTST, TSMixer, and NLinear each win at least one case study. The primary takeaway is methodological: the cross-sectional feature engineering in Chapters 8-9, which constructs lagged, windowed, and differenced inputs, already encodes much of the temporal information that DL architectures would need to learn from raw sequences, so DL adds complexity without adding signal when baseline features are strong.
Related Case Studies
See where these chapter concepts get applied in end-to-end trading workflows.
ETF Cross-Asset Exposures
All six model families compared across 100 ETFs spanning 9 asset classes
Crypto Perpetuals Funding
Alternative data and non-standard frequencies in 24/7 crypto markets
NASDAQ-100 Microstructure
Intraday microstructure signals across 114 stocks at 15-minute frequency
S&P 500 Equity + Option Analytics
Combining options-derived features with equity data for multi-source prediction
FX Spot Pairs
Momentum and carry factors in the world's most liquid market
CME Futures
Carry signals across 30 products — data quality as the critical variable
S&P 500 Options (Straddles)
Direct options trading and why equity-style cost models fail for options
US Equities Panel
Large-scale cross-sectional prediction across 3,200 stocks with 16 walk-forward folds