Chapter 13: Deep Learning for Time Series

Making Transformers Time-Aware

A vanilla Transformer is good at flexible token interaction. Time-series forecasting needs more than that: it needs temporal and structural inductive bias.

Making Transformers Time-Aware

A vanilla Transformer is good at flexible token interaction. Time-series forecasting needs more than that: it needs temporal and structural inductive bias.

The Intuition

The Chapter 13 critique is not that Transformers are useless for time series. It is that plain self-attention is too generic.

In language, that generality is a strength. In time series, it creates a problem:

  • order is the structure
  • horizon matters
  • covariates arrive on different clocks
  • channels can be correlated or spuriously entangled

That is why successful time-series Transformer variants are really answers to one question:

how do we inject time awareness into an architecture that does not get it for free?

Why Vanilla Attention Is Not Enough

Self-attention over time-step tokens lets every time step attend to every other one. That provides long-range interaction, but it does not by itself encode:

  • local continuity
  • recentness bias
  • seasonal scale structure
  • variable-specific roles

Positional encodings help, but they do not fully solve the problem. They tell the model where a token sits; they do not give it the stronger inductive biases that many forecasting tasks need.

The Main Repair Strategies

Modern time-series Transformers usually modify one or more of these:

  1. tokenization and local structure
  2. channel structure
  3. covariate timing
  4. time-aware covariate design
  5. causal forecasting design

Each change is a way of saying: "do not make the model rediscover obvious temporal structure from scratch."

Strategy 1: Patching

PatchTST-style models group adjacent observations into short patches before attention.

Instead of tokens being individual time steps, they become local segments.

Why this helps:

  • fewer tokens, so attention is cheaper
  • each token already contains short-range local structure
  • the model focuses on interactions among segments rather than raw pointwise noise

For a lookback window of length \(L\), patch length \(p\), and stride \(s\), the token count is:

$$ N_{\text{tokens}} = \left\lfloor \frac{L-p}{s} \right\rfloor + 1. $$

With the common non-overlapping default \(s=p\), this is about \(L/p\) tokens. That is a computational change and an inductive-bias change at the same time.

Strategy 2: Channel Independence

One recurring problem in multivariate forecasting is that attention can overfit noisy cross-series correlations.

PatchTST's channel-independent design addresses this by:

  • applying a shared encoder to each channel
  • letting each channel keep its own local temporal representation
  • largely avoiding heavy cross-channel attention in the encoder itself

This is a form of regularization. It says:

shared temporal patterns are useful, but each variable should earn its own representation first.

That can be especially helpful in finance, where a few strong factors coexist with many unstable cross-series relationships.

Strategy 3: Variate-as-Token Attention

iTransformer flips the usual setup. Instead of treating time steps as tokens, it treats variables as tokens and lets attention operate across variables.

That sounds strange at first, but it targets a real problem. In many multivariate panels, the hard part is not only temporal extrapolation. It is learning which variables matter for each other.

So the design becomes:

  • temporal information is first encoded within each variable stream
  • attention models inter-variable relationships

This usually means each variable's recent history is compressed into an embedding before attention operates across variables. The point is not a full per-variable temporal Transformer stage; it is a representation inversion that changes where attention is spent.

This can work well when cross-variable structure is more stable than long-range time-step-to-time- step attention.

Strategy 4: Time-Aware Covariate Design

Some architectures, such as TFT, are really about covariates as much as about attention.

They distinguish:

  • static covariates
  • known future covariates
  • observed historical covariates

That distinction matters because forecasting is partly a timing problem. A covariate that is known at prediction time is different from one only observed after the fact.

TFT matters not only because it labels covariates differently, but because variable-selection and gating components actually use those distinctions inside the architecture.

The time-aware lesson is not "use TFT." It is:

the architecture should reflect what information arrives when.

Strategy 5: Causal Forecasting Design

Time awareness in forecasting is not only about tokenization. It is also about respecting the fact that future targets are unknown.

Common tools:

  • causal masking in decoder-style attention
  • horizon-specific output heads
  • direct multi-horizon prediction setups
  • encoder-style designs with leakage-safe inputs

Not all forecasting Transformers are autoregressive decoders. Many remain encoder-style forecasters and enforce time safety through input construction and horizon-specific outputs rather than token-by- token masking. These choices determine whether the model is solving a proper forecasting problem or quietly peeking through a badly specified sequence setup.

A Worked ETF Panel Example

Imagine forecasting next-week returns for a panel of sector ETFs using:

  • past returns
  • realized volatility
  • macro covariates
  • calendar features

Three tokenizations:

  1. Vanilla Transformer tokens are time steps containing all channels
  2. PatchTST tokens are short time patches within each channel
  3. iTransformer tokens are variables, with temporal information encoded inside each stream

Interpretation:

  • vanilla attention may overfit noisy time-step interactions
  • PatchTST may work better if local temporal motifs matter
  • iTransformer may work better if the key signal lies in cross-asset relationships

The architecture choice is therefore a hypothesis about where the structure lives.

This is also where the DLinear/NLinear critique matters: if a strong linear baseline already wins, the burden is on PatchTST, iTransformer, or TFT to show that their time-aware structure adds something real.

Patching, Lookback, and Overfitting

These models are sensitive to design choices:

  • patch size
  • stride
  • lookback length
  • positional encoding scheme
  • forecast horizon

Too small a patch and the model behaves almost like noisy pointwise attention. Too large a patch and you smear out useful local structure.

Likewise, very long lookbacks can make the architecture look sophisticated while mostly feeding it stale information.

What "Time-Aware" Should Mean in Practice

A time-aware Transformer design should do at least one of these explicitly:

  • encode local temporal structure before global interaction
  • distinguish variable structure from time structure
  • represent the arrival timing of covariates correctly
  • constrain complexity so the model cannot win only by memorizing benchmark quirks

That is why Chapter 13 treats these models as architectural responses to a valid critique rather than as isolated innovations. Real financial timing issues such as session boundaries, missing bars, and macro-release calendars need to be represented more explicitly than positional encoding alone.

In Practice

Good practical habits:

  • benchmark against strong linear and decomposition baselines
  • treat tokenization as a modeling decision, not an implementation detail
  • stress-test lookback sensitivity
  • verify covariate timing carefully
  • prefer the simplest architecture that captures the relevant structure

If a fancier Transformer only wins on one benchmark setting and loses to simple baselines elsewhere, the time awareness is probably not doing what you hoped.

Common Mistakes

  • Treating positional encoding as if it fully solves temporal structure.
  • Choosing patch size or lookback length by aesthetic preference.
  • Ignoring whether cross-channel attention is actually stable enough to learn.
  • Mixing known-future and observed-late covariates carelessly.
  • Comparing architectures without a serious linear baseline.

Connections

This primer supports Chapter 13's Transformer-evolution material and connects directly to the self-attention primer in Chapter 10, long-document tokenization logic, strong linear baselines, and the broader forecasting debate over when deep architectures actually add value beyond simpler models.

Register to Read

Sign up for a free account to access all 61 primer articles.

Create Free Account

Already have an account? Sign in