Chapter 13

Deep Learning for Time Series

9 sections 18 notebooks 28 references Code

Learning Objectives

  • Explain why recurrent sequence models became a computational and optimization bottleneck for long-context forecasting tasks
  • Compare the main temporal modeling philosophies — decomposition-based, attention-based, state-space, and strong linear baselines — and explain when each is most appropriate
  • Use strong baselines and diagnostics, including linear models and walk-forward evaluation, to judge whether sequence-model complexity is warranted
  • Distinguish the design logic of modern time-series Transformer variants, including PatchTST, iTransformer, and TFT, and relate those choices to multivariate structure, covariates, and forecast horizon
  • Decide when a financial prediction problem should be framed as direct panel regression with sequential inputs rather than multi-step time-series forecasting
  • Evaluate time-series foundation model adaptation modes for financial applications, including the implications of transfer mismatch and pretraining contamination
  • Apply practical uncertainty estimation methods, including MC Dropout and deep ensembles, to support risk-aware trading decisions
Figure 13.4
13.1

The Recurrent Paradigm and Its Discontents

The section examines LSTMs and GRUs as the historical default for time series forecasting, explaining the gating mechanism that addresses vanishing gradients but identifying two fundamental limitations that motivated the search for alternatives: the sequential computation bottleneck (O(T) dependency preventing temporal parallelism) and gradient flow degradation over very long sequences. It argues that financial data's long-memory properties — volatility clustering, persistent order flow, calendar effects spanning hundreds of trading days — push beyond what recurrent architectures can reliably capture through gradient-based learning.

1 notebook

13.2

The Decomposition Philosophy: N-BEATS

N-BEATS encodes the inductive bias that time series are compositions of interpretable components (trend and seasonality) that neural networks should decompose explicitly rather than discover implicitly, using a hierarchical architecture of blocks, stacks, and doubly-residual connections that parallels boosting in tree-based methods. The section explains how basis expansion — predicting through learned coefficients weighting polynomial (trend) or Fourier (seasonality) basis functions — generalizes classical decomposition while enabling transparency rare among deep learning models. It covers the interpretable configuration (constrained trend and seasonality stacks) and extensions (N-BEATSx for exogenous variables, N-HiTS for multi-resolution forecasting), with practical caveats about lookback-to-horizon sensitivity and cases where financial regime changes violate both polynomial and Fourier assumptions.

1 notebook

13.3

The Attention Revolution for Time Series

The section covers the adaptation of the Transformer architecture from NLP to time series through three key modifications: patching (segmenting input windows into subsequences to reduce computational cost), positional encoding (injecting temporal order into the permutation-invariant attention mechanism), and encoder-decoder structure with the choice between iterative and direct multi-horizon forecasting. It presents the self-attention mechanism's ability to create direct connections between any two positions regardless of distance, and surveys early efficiency-focused variants (Informer, Autoformer, FEDformer) that shared the assumption of tokenizing time series like language — an assumption that Section 13.4 would challenge.

13.4

The Great Debate: When Simple Outperforms Complex

Zeng et al. (2022) demonstrated that single-layer linear models (Linear, D-Linear, N-Linear) outperformed Transformer architectures by 20-50% across all nine LTSF benchmark datasets, with diagnostic experiments showing that Transformers were largely ignoring temporal order — performance was remarkably unaffected when input sequences were randomly shuffled. The section explains the core critique that self-attention's permutation-invariance is a fundamental mismatch for time series where temporal ordering itself carries the predictive signal. While acknowledging counterpoints (multivariate settings, tuning sensitivity, scaling potential), it establishes that any new architecture must now beat LTSF-Linear baselines, shifting the field's goal from making Transformers bigger to making them smarter.

1 notebook

13.5

The Transformer's Evolution

The section presents three post-critique Transformer architectures that address the temporal inductive bias problem through different design choices: PatchTST (treating patches as tokens with channel-independence as regularization and RevIN for non-stationarity), iTransformer (inverting the attention dimension so it operates across variables rather than time steps, sidestepping the temporal order critique entirely), and TFT (learned variable selection for covariate-rich settings with native multi-horizon quantile output). It provides a comparative summary showing how each architecture targets different forecasting scenarios, while noting that on pre-engineered features, IC differences between architectures are often modest — confirming that the marginal benefit of temporal modeling depends on how much signal the feature pipeline has already extracted.

1 notebook

13.6

The Full Toolkit: Alternative Architectures and Foundation Models

The section surveys TCNs, TSMixer, CNN-based approaches, hybrid statistical-neural models, and state space models (Mamba), positioning SSMs as particularly promising for very long sequences (>10K steps) due to their linear O(L) complexity with selective state spaces that filter noise contextually. The extensive treatment of time series foundation models covers the evolution from first-generation univariate-only models to second-generation multivariate systems (Chronos-2, Moirai-MoE), four adaptation modes (zero-shot, in-context learning, PEFT, model selection/ensembling), and the efficiency frontier challenging the scaling paradigm. The critical finance transfer gap evidence shows that off-the-shelf TSFMs underperform tree-based ensembles on return prediction but show genuine promise for volatility and VaR forecasting where target structure is more transferable.

5 notebooks

13.7

A Practitioner's Framework

This section synthesizes the chapter's evidence into a three-step model selection process: establish strong baselines (seasonal naive through GBMs, with sample-size thresholds and foundation model tiers), diagnose the problem along five axes (univariate vs multivariate, known vs unknown relationships, forecast horizon, interpretability needs, sequence length), and apply a selection matrix matching problem characteristics to recommended architectures. The cross-dataset evidence shows the split is roughly even between tabular and DL models across eight case studies, with the practical decision rule being to test whether temporal ordering adds incremental signal beyond lag-feature engineering. The section also surveys the library landscape (Darts, NeuralForecast, PyTorch Forecasting, GluonTS, sktime) with guidance on when each is appropriate.

2 notebooks

13.8

Quantifying Prediction Uncertainty

The section develops two practical approaches for uncertainty estimation from deep learning models: MC Dropout (keeping dropout active at inference and running multiple forward passes to approximate Bayesian posterior sampling) and Deep Ensembles (training models with different initializations and using their disagreement as an uncertainty measure with formal epistemic-aleatoric decomposition). It shows that MC Dropout uncertainty shows meaningful correlation with actual prediction errors, making it practically useful for filtering unreliable forecasts, while deep ensembles of 5 members often provide the best tradeoff between accuracy, calibration, and compute cost. The section notes that foundation models exhibit systematic miscalibration under financial regime shifts, recommending conformal calibration from Chapter 11 as a distribution-free correction.

1 notebook

13.9

Cross-Dataset Insights

Aggregating walk-forward results across eight case studies against Ridge and GBM baselines produces a sobering headline: DL rarely outperforms strong tabular baselines. The clearest DL-positive case is crypto perpetuals funding (LSTM IC +0.030 vs GBM +0.023), where the 8-hourly frequency generates temporal structure not fully captured by cross-sectional features. No single DL architecture dominates — LSTM, PatchTST, TSMixer, and NLinear each win at least one case study. The primary takeaway is methodological: the cross-sectional feature engineering in Chapters 8-9, which constructs lagged, windowed, and differenced inputs, already encodes much of the temporal information that DL architectures would need to learn from raw sequences, so DL adds complexity without adding signal when baseline features are strong.