Chapter 13: Deep Learning for Time Series

Uncertainty Estimation and Calibration for Deep Time-Series Models

A forecasting model is not uncertainty-aware because it emits a variance. It is uncertainty-aware only if that variance tracks future error under the validation protocol you actually trade.

Uncertainty Estimation and Calibration for Deep Time-Series Models

A forecasting model is not uncertainty-aware because it emits a variance. It is uncertainty-aware only if that variance tracks future error under the validation protocol you actually trade.

The Intuition

Chapter 13 connects deep forecasts to position sizing and risk control. That only works if the uncertainty estimate means something operational.

Three distinctions matter.

  • A point forecast can be accurate on average yet dangerous if its large errors cluster in stressed regimes.
  • A wide predictive interval can look conservative while still being badly calibrated if realized errors exceed it too often.
  • Model uncertainty and data uncertainty are different objects, and most practical estimators mix them only approximately.

The goal is not "Bayesian purity." The goal is to know whether the model becomes less trustworthy exactly when the environment becomes harder to forecast.

Two Kinds of Uncertainty

Write the target as

$$ y_t = f(x_t) + \varepsilon_t. $$

The decomposition readers usually need is:

  • epistemic uncertainty: uncertainty about the function \(f\), often high when data are sparse, regimes shift, or the model class is unstable
  • aleatoric uncertainty: irreducible noise in \(\varepsilon_t\), often high when the target is intrinsically volatile even if the model is well specified

In trading terms:

  • epistemic uncertainty rises when the model is extrapolating, such as a crisis regime that was rare in training
  • aleatoric uncertainty rises when even a correct model faces a noisy target, such as short-horizon returns around macro announcements

This distinction matters because the actions differ. Epistemic uncertainty argues for smaller bets, more caution, or model fallback. Aleatoric uncertainty argues for wider risk buffers even if the model is still learning the right relationship.

MC Dropout as Approximate Model Uncertainty

Dropout during training randomly masks units to regularize the network. MC Dropout keeps dropout active at inference and treats repeated forward passes as draws from an approximate posterior over network functions.

If \(\hat{y}_t^{(m)}\) is the prediction from stochastic pass \(m=1,\dots,M\), then

$$ \bar{y}_t = \frac{1}{M}\sum_{m=1}^M \hat{y}_t^{(m)} $$

is the predictive mean and

$$ \widehat{\mathrm{Var}}_{\text{MC}}(y_t \mid x_t) = \frac{1}{M}\sum_{m=1}^M \left(\hat{y}_t^{(m)} - \bar{y}_t\right)^2 $$

is the Monte Carlo variance.

Here it should be read as disagreement across stochastic forward passes, so it is a practical proxy for model uncertainty rather than a full predictive-variance decomposition.

This is useful because it is cheap and easy to retrofit. But it has limits:

  • the posterior approximation is crude
  • the result depends strongly on dropout rate and architecture
  • low MC variance does not prove the model is well specified

So MC Dropout should be treated as a practical diagnostic for predictive instability, not as a proof that you have solved Bayesian inference.

Deep Ensembles

Deep ensembles train several models with different initializations, minibatch orderings, or data subsamples. Let model \(k\) output mean \(\mu_t^{(k)}\) and, if available, predictive variance \((\sigma_t^{2})^{(k)}\). Then the ensemble mean is

$$ \bar{\mu}_t = \frac{1}{K}\sum_{k=1}^K \mu_t^{(k)} $$

and the predictive variance can be decomposed as

$$ \widehat{\mathrm{Var}}(y_t \mid x_t) = \underbrace{\frac{1}{K}\sum_{k=1}^K (\sigma_t^2)^{(k)}}_{\text{aleatoric part}} + \underbrace{\frac{1}{K}\sum_{k=1}^K \left(\mu_t^{(k)} - \bar{\mu}_t\right)^2}_{\text{epistemic part}}. $$

This decomposition is not exact in a philosophical sense, but it is operationally useful. The first term says how noisy each model thinks the target is. The second says how much trained models disagree about the mean. If the ensemble members output only point forecasts rather than predictive variances, the aleatoric term is absent and the disagreement term is often used as a rough total-uncertainty proxy.

In practice, ensembles are often better calibrated than a single network with MC Dropout, at the cost of more training time.

Calibration Is a Separate Problem

A model can rank uncertainty correctly and still be miscalibrated in level. That is why raw uncertainty outputs should be evaluated against realized forecast errors.

For point forecasts, a simple diagnostic is whether predicted standard deviation tracks realized absolute error. If \(\hat{\sigma}_t\) is the model's uncertainty score, sort predictions into deciles by \(\hat{\sigma}_t\) and check whether higher-decile bins actually contain larger realized errors.

For interval forecasts, if the model reports

$$ [\hat{y}_t^{L}, \hat{y}_t^{U}], $$

then a nominal 90% interval should contain the realized \(y_t\) about 90% of the time under a proper walk-forward evaluation, not just in-sample.

Two practical scoring rules matter:

  • negative log-likelihood if the model outputs a full predictive distribution
  • CRPS if you want a proper scoring rule that rewards both calibration and sharpness

Sharper intervals are not automatically better. A too-narrow interval that misses often is worse than a wider one that is honestly calibrated.

A Worked Trading Scenario

Suppose you forecast next-day realized volatility for an options overlay.

You train:

  • one network with MC Dropout
  • one 5-member deep ensemble

Both produce similar mean-squared error on the point forecast. But now look at uncertainty deciles.

Bad interpretation

You use the raw variance output from the single model as if it were a risk forecast. Position sizes shrink only slightly in the top uncertainty decile, because the variance estimates are too stable.

Better interpretation

You evaluate:

  1. coverage of 80% and 95% predictive intervals
  2. realized absolute error by uncertainty decile
  3. performance stability in calm versus stressed periods

You find these illustrative values:

Method 80% interval coverage, calm weeks 80% interval coverage, stressed weeks
MC Dropout 0.79 0.61
5-model ensemble 0.81 0.72
  • MC Dropout uncertainty rises modestly in volatile weeks but underreacts in crisis windows
  • the ensemble disagreement term spikes strongly around regime breaks
  • both methods are overconfident before recalibration

Now the uncertainty output becomes useful. The ensemble variance is not just a decorative extra column. It identifies when the model is extrapolating and when forecast-driven sizing should back off.

What Good Validation Looks Like

For Chapter 13, a good uncertainty evaluation block should answer four questions:

  1. Does higher predicted uncertainty correspond to larger realized error?
  2. Are nominal intervals close to empirical coverage under walk-forward testing?
  3. Does uncertainty rise in distribution shift or regime-break periods?
  4. Does using uncertainty improve downstream trading decisions, not just forecast diagnostics?

That fourth question is the real one. If uncertainty-aware sizing does not reduce drawdowns, stabilize leverage, or improve risk-adjusted performance, then the uncertainty estimate may be statistically interesting but economically unhelpful.

In Practice

Use these rules:

  • evaluate uncertainty under the same walk-forward protocol as the point forecast
  • separate ranking quality from calibration quality
  • compare MC Dropout against a small ensemble rather than assuming one method dominates
  • use enough stochastic passes that the disagreement estimate is numerically stable
  • recalibrate if coverage is persistently wrong
  • inspect uncertainty by regime, not only on the full-sample average

Common Mistakes

  • Treating raw dropout variance as a trustworthy risk forecast without backtesting calibration.
  • Treating MC Dropout variance as purely epistemic when it can mix model disagreement with learned noise structure.
  • Judging uncertainty only by negative log-likelihood and never checking empirical coverage.
  • Using random cross-validation to validate calibration for a time-dependent forecasting task.
  • Assuming uncertainty-aware models are useful even if downstream position sizing never changes.

Connections

This primer supports Chapter 13's treatment of MC Dropout, deep ensembles, and risk-aware forecasting. It connects directly to conformal prediction in Chapter 11, risk forecasting in Chapter 19, and the broader rule that probabilistic outputs are only valuable when they remain calibrated under time-aware validation.

Register to Read

Sign up for a free account to access all 61 primer articles.

Create Free Account

Already have an account? Sign in