Chapter 13

Deep Learning for Time Series

9 sections 18 notebooks 28 references Code

Learning Objectives

Explain why recurrent sequence models became a computational and optimization bottleneck for long-context forecasting tasks
Compare the main temporal modeling philosophies — decomposition-based, attention-based, state-space, and strong linear baselines — and explain when each is most appropriate
Use strong baselines and diagnostics, including linear models and walk-forward evaluation, to judge whether sequence-model complexity is warranted
Distinguish the design logic of modern time-series Transformer variants, including PatchTST, iTransformer, and TFT, and relate those choices to multivariate structure, covariates, and forecast horizon
Decide when a financial prediction problem should be framed as direct panel regression with sequential inputs rather than multi-step time-series forecasting
Evaluate time-series foundation model adaptation modes for financial applications, including the implications of transfer mismatch and pretraining contamination
Apply practical uncertainty estimation methods, including MC Dropout and deep ensembles, to support risk-aware trading decisions

13.1

The Recurrent Paradigm and Its Discontents

The section examines LSTMs and GRUs as the historical default for time series forecasting, explaining the gating mechanism that addresses vanishing gradients but identifying two fundamental limitations that motivated the search for alternatives: the sequential computation bottleneck (O(T) dependency preventing temporal parallelism) and gradient flow degradation over very long sequences. It argues that financial data's long-memory properties — volatility clustering, persistent order flow, calendar effects spanning hundreds of trading days — push beyond what recurrent architectures can reliably capture through gradient-based learning.

1 notebook

13.2

The Decomposition Philosophy: N-BEATS

N-BEATS encodes the inductive bias that time series are compositions of interpretable components (trend and seasonality) that neural networks should decompose explicitly rather than discover implicitly, using a hierarchical architecture of blocks, stacks, and doubly-residual connections that parallels boosting in tree-based methods. The section explains how basis expansion — predicting through learned coefficients weighting polynomial (trend) or Fourier (seasonality) basis functions — generalizes classical decomposition while enabling transparency rare among deep learning models. It covers the interpretable configuration (constrained trend and seasonality stacks) and extensions (N-BEATSx for exogenous variables, N-HiTS for multi-resolution forecasting), with practical caveats about lookback-to-horizon sensitivity and cases where financial regime changes violate both polynomial and Fourier assumptions.

1 notebook

13.3

The Attention Revolution for Time Series

The section covers the adaptation of the Transformer architecture from NLP to time series through three key modifications: patching (segmenting input windows into subsequences to reduce computational cost), positional encoding (injecting temporal order into the permutation-invariant attention mechanism), and encoder-decoder structure with the choice between iterative and direct multi-horizon forecasting. It presents the self-attention mechanism's ability to create direct connections between any two positions regardless of distance, and surveys early efficiency-focused variants (Informer, Autoformer, FEDformer) that shared the assumption of tokenizing time series like language — an assumption that Section 13.4 would challenge.

13.4

The Great Debate: When Simple Outperforms Complex

Zeng et al. (2022) demonstrated that single-layer linear models (Linear, D-Linear, N-Linear) outperformed Transformer architectures by 20-50% across all nine LTSF benchmark datasets, with diagnostic experiments showing that Transformers were largely ignoring temporal order — performance was remarkably unaffected when input sequences were randomly shuffled. The section explains the core critique that self-attention's permutation-invariance is a fundamental mismatch for time series where temporal ordering itself carries the predictive signal. While acknowledging counterpoints (multivariate settings, tuning sensitivity, scaling potential), it establishes that any new architecture must now beat LTSF-Linear baselines, shifting the field's goal from making Transformers bigger to making them smarter.

1 notebook

13.5

The Transformer's Evolution

The section presents three post-critique Transformer architectures that address the temporal inductive bias problem through different design choices: PatchTST (treating patches as tokens with channel-independence as regularization and RevIN for non-stationarity), iTransformer (inverting the attention dimension so it operates across variables rather than time steps, sidestepping the temporal order critique entirely), and TFT (learned variable selection for covariate-rich settings with native multi-horizon quantile output). It provides a comparative summary showing how each architecture targets different forecasting scenarios, while noting that on pre-engineered features, IC differences between architectures are often modest — confirming that the marginal benefit of temporal modeling depends on how much signal the feature pipeline has already extracted.

1 notebook

13.6

The Full Toolkit: Alternative Architectures and Foundation Models

The section surveys TCNs, TSMixer, CNN-based approaches, hybrid statistical-neural models, and state space models (Mamba), positioning SSMs as particularly promising for very long sequences (>10K steps) due to their linear O(L) complexity with selective state spaces that filter noise contextually. The extensive treatment of time series foundation models covers the evolution from first-generation univariate-only models to second-generation multivariate systems (Chronos-2, Moirai-MoE), four adaptation modes (zero-shot, in-context learning, PEFT, model selection/ensembling), and the efficiency frontier challenging the scaling paradigm. The critical finance transfer gap evidence shows that off-the-shelf TSFMs underperform tree-based ensembles on return prediction but show genuine promise for volatility and VaR forecasting where target structure is more transferable.

5 notebooks

13.7

A Practitioner's Framework

This section synthesizes the chapter's evidence into a three-step model selection process: establish strong baselines (seasonal naive through GBMs, with sample-size thresholds and foundation model tiers), diagnose the problem along five axes (univariate vs multivariate, known vs unknown relationships, forecast horizon, interpretability needs, sequence length), and apply a selection matrix matching problem characteristics to recommended architectures. The cross-dataset evidence shows the split is roughly even between tabular and DL models across eight case studies, with the practical decision rule being to test whether temporal ordering adds incremental signal beyond lag-feature engineering. The section also surveys the library landscape (Darts, NeuralForecast, PyTorch Forecasting, GluonTS, sktime) with guidance on when each is appropriate.

2 notebooks

13.8

Quantifying Prediction Uncertainty

The section develops two practical approaches for uncertainty estimation from deep learning models: MC Dropout (keeping dropout active at inference and running multiple forward passes to approximate Bayesian posterior sampling) and Deep Ensembles (training models with different initializations and using their disagreement as an uncertainty measure with formal epistemic-aleatoric decomposition). It shows that MC Dropout uncertainty shows meaningful correlation with actual prediction errors, making it practically useful for filtering unreliable forecasts, while deep ensembles of 5 members often provide the best tradeoff between accuracy, calibration, and compute cost. The section notes that foundation models exhibit systematic miscalibration under financial regime shifts, recommending conformal calibration from Chapter 11 as a distribution-free correction.

1 notebook

13.9

Cross-Dataset Insights

Aggregating walk-forward results across eight case studies against Ridge and GBM baselines produces a sobering headline: DL rarely outperforms strong tabular baselines. The clearest DL-positive case is crypto perpetuals funding (LSTM IC +0.030 vs GBM +0.023), where the 8-hourly frequency generates temporal structure not fully captured by cross-sectional features. No single DL architecture dominates — LSTM, PatchTST, TSMixer, and NLinear each win at least one case study. The primary takeaway is methodological: the cross-sectional feature engineering in Chapters 8-9, which constructs lagged, windowed, and differenced inputs, already encodes much of the temporal information that DL architectures would need to learn from raw sequences, so DL adds complexity without adding signal when baseline features are strong.

Related Case Studies

See where these chapter concepts get applied in end-to-end trading workflows.

All case studies

ETF Cross-Asset Exposures

All six model families compared across 100 ETFs spanning 9 asset classes

ETFs Daily

Crypto Perpetuals Funding

Alternative data and non-standard frequencies in 24/7 crypto markets

Cryptocurrency 8-Hour

NASDAQ-100 Microstructure

Intraday microstructure signals across 114 stocks at 15-minute frequency

Equities 15-Minute

S&P 500 Equity + Option Analytics

Combining options-derived features with equity data for multi-source prediction

Options Daily

FX Spot Pairs

Momentum and carry factors in the world's most liquid market

Foreign Exchange Daily

CME Futures

Carry signals across 30 products — data quality as the critical variable

Futures Daily

S&P 500 Options (Straddles)

Direct options trading and why equity-style cost models fail for options

Options Daily

US Equities Panel

Large-scale cross-sectional prediction across 3,200 stocks with 16 walk-forward folds

Equities Daily

All Chapters

Deep Learning for Time Series

Learning Objectives

The Recurrent Paradigm and Its Discontents

The Decomposition Philosophy: N-BEATS

The Attention Revolution for Time Series

The Great Debate: When Simple Outperforms Complex

The Transformer's Evolution

The Full Toolkit: Alternative Architectures and Foundation Models

A Practitioner's Framework

Quantifying Prediction Uncertainty

Cross-Dataset Insights

Related Case Studies

ETF Cross-Asset Exposures

Crypto Perpetuals Funding

NASDAQ-100 Microstructure

S&P 500 Equity + Option Analytics

FX Spot Pairs

CME Futures

S&P 500 Options (Straddles)

US Equities Panel

Making Transformers Time-Aware

State Space Models: From Kalman Intuition to Mamba

Uncertainty Estimation and Calibration for Deep Time-Series Models