Chapter 9: Model-Based Feature Extraction

State-Space Models and the Kalman Filter

Kalman filter outputs are widely used as trading features. This primer covers the deeper machinery underneath them: the innovation representation, Riccati recursion, and identification choices that determine what those features actually mean.

State-Space Models and the Kalman Filter

Kalman filter outputs are widely used as trading features. This primer covers the deeper machinery underneath them: the innovation representation, Riccati recursion, and identification choices that determine what those features actually mean.

The Intuition

A rolling average only sees a window of observations. It does not have an explicit idea of what the hidden process is doing, and it treats the entire window through a fixed rule. A state-space model starts from a different question: what unobserved quantity is generating the data, and how should our belief about that quantity change as new observations arrive?

That hidden quantity might be a latent trend, a smooth level, or a time-varying hedge ratio. The observed series is noisy. Prices jump for idiosyncratic reasons, spreads widen, and measurements arrive with error. The model therefore separates two objects:

the hidden state we care about,
the noisy measurement we actually observe.

The Kalman filter is the recursive algorithm that keeps updating the hidden state estimate over time. It first predicts where the state should be if the previous estimate were correct, then updates that prediction after seeing the new observation. The update is not all-or-nothing. If the observation looks noisy, the filter trusts its prior estimate more. If the observation looks informative, the filter moves more aggressively toward the new data.

In practice, feature pipelines extract filtered states, innovations, and uncertainty estimates from the Kalman recursion. The technical layer a skeptical quant will care about lies underneath: why the gain takes the value it does, how the Riccati recursion governs responsiveness, and when the latent state interpretation is actually identified rather than just numerically convenient.

The Math

In a linear Gaussian state-space model, the hidden state $x_t$ evolves as

$$ x_t = F x_{t-1} + w_t, $$

and the observed data $y_t$ are generated by

$$ y_t = H x_t + v_t, $$

where:

$x_t$ is the latent state,
$y_t$ is the observed series,
$F$ is the state-transition matrix,
$H$ maps the state to the observation,
$w_t \sim \mathcal{N}(0, Q)$ is process noise,
$v_t \sim \mathcal{N}(0, R)$ is observation noise.

The two covariance matrices do most of the conceptual work:

$Q$ controls how much the hidden state is allowed to move between periods,
$R$ controls how noisy the measurement is assumed to be.

Large $Q$ means the state itself can change quickly, so the filter should adapt faster. Large $R$ means the observations are unreliable, so the filter should smooth more aggressively.

The Kalman filter maintains the posterior mean and covariance of the state. Let $\hat x_{t|t-1}$ denote the predicted state before seeing $y_t$, and let $\hat x_{t|t}$ denote the updated state after incorporating $y_t$. The recursion is:

$$ \hat x_{t|t-1} = F \hat x_{t-1|t-1}, $$

$$ P_{t|t-1} = F P_{t-1|t-1} F^\top + Q, $$

$$ \nu_t = y_t - H \hat x_{t|t-1}, $$

$$ S_t = H P_{t|t-1} H^\top + R, $$

$$ K_t = P_{t|t-1} H^\top S_t^{-1}, $$

$$ \hat x_{t|t} = \hat x_{t|t-1} + K_t \nu_t, $$

$$ P_{t|t} = (I - K_t H) P_{t|t-1}. $$

These objects map directly to the feature-engineering language:

$\hat x_{t|t}$ is the filtered state estimate,
$\nu_t$ is the innovation, or one-step prediction error,
$P_{t|t}$ is the posterior uncertainty,
$K_t$ is the adaptation weight that decides how much the estimate moves.

In the scalar local-level model, where the latent state is just a smooth underlying level,

$$ x_t = x_{t-1} + w_t, \qquad y_t = x_t + v_t, $$

the update becomes

$$ \hat x_{t|t} = \hat x_{t|t-1} + K_t (y_t - \hat x_{t|t-1}). $$

This looks like exponential smoothing, but with a time-varying weight $K_t$. That is the simplest way to understand the filter. The Kalman gain is a data-driven smoothing weight:

high $K_t$ means "trust the new observation,"
low $K_t$ means "treat the new observation as mostly noise."

Riccati Geometry and Steady-State Gain

The most important object in the filter is not the state estimate itself but the covariance recursion

$$ P_{t|t-1} = F P_{t-1|t-1} F^\top + Q, \qquad P_{t|t} = (I-K_tH)P_{t|t-1}. $$

That recursion is a discrete Riccati equation. It tells you how uncertainty evolves under the competing forces of state drift and measurement information. In the scalar local-level model with $F=H=1$, the gain is

$$ K_t = \frac{P_{t-1|t-1}+Q}{P_{t-1|t-1}+Q+R}. $$

So the smoothing behavior is not a hand-tuned moving-average window. It is the endogenous consequence of uncertainty propagation. If $Q/R$ is small, $K_t$ converges to a low steady-state value and the filter behaves like a very smooth exponential smoother. If $Q/R$ is large, the fixed point implies a high gain and the filter becomes fast but noisy.

This is also why filtered uncertainty is itself a feature. A large innovation $\nu_t$ means something different when $S_t$ is tight than when the predictive distribution is already wide. The right normalization is the standardized innovation

$$ z_t = \nu_t / \sqrt{S_t}, $$

which is the quantity the model says should be approximately standard normal under correct specification.

Likelihood, Identification, and Causal Estimation

A serious implementation issue is that many different parameter quadruples $(F,H,Q,R)$ can generate similar filtered paths in finite samples. The model is not identified just because the optimizer returned a number. In practice, quants often impose structure for exactly this reason: local-level, local-linear-trend, or random-walk coefficient dynamics are not merely convenient; they reduce the parameter space to something interpretable.

Under Gaussian assumptions, the one-step innovations produce the log-likelihood

$$ \ell(\theta) = -\tfrac12 \sum_{t=1}^T\left(\log|S_t| + \nu_t^\top S_t^{-1}\nu_t\right) + C, $$

where $\theta$ denotes the model parameters and $C$ absorbs constants. This is the estimation route behind maximum likelihood and EM-style state-space fitting. It also reveals the correct diagnostic boundary: if the standardized innovations still show serial correlation, remaining heteroskedasticity, or obvious non-Gaussian structure, then the state-space specification is not just imperfect, it is misallocating variation between latent state and measurement noise.

For feature engineering, the practical consequence is strict. Estimate the structural parameters on the training window, carry them forward, and only then update the state online. Re-estimating on the full sample produces cleaner-looking state histories at the cost of silently changing the gain with future information.

Worked Example

Suppose a latent signal drifts gradually, but the observed series includes large day-to-day noise. A trailing moving average will either be too slow or too jittery depending on window length. The Kalman filter adjusts the weight dynamically using its uncertainty estimate.

The small example below simulates a drifting latent level and applies a scalar Kalman filter:

$python import numpy as np rng = np.random.default_rng(7) n = 200 true_level = np.cumsum(rng.normal(0.05, 0.15, size=n)) observed = true_level + rng.normal(0, 0.8, size=n) Q = 0.05 # assumed process variance (deliberately approximate) R = 0.64 # observation variance (matches simulation) x_hat = np.zeros(n) P = np.zeros(n) x_hat[0] = observed[0] P[0] = 1.0 for t in range(1, n): x_pred = x_hat[t - 1] P_pred = P[t - 1] + Q innovation = observed[t] - x_pred S = P_pred + R K = P_pred / S x_hat[t] = x_pred + K * innovation P[t] = (1 - K) * P_pred $

If you plot observed, true_level, and x_hat, the filtered estimate will be smoother than the raw observations but faster than a long moving average. The important object is not only the smoothed line. The innovation observed[t] - x_pred becomes a feature in its own right. Large innovations indicate that the incoming observation is surprising relative to the model's prior expectation. In trading terms, that can mark either a temporary dislocation or the start of a new regime.

Figure Specification

Use a three-panel conceptual figure:

Panel A: noisy observations around a smooth latent level.
Panel B: predict-update flowchart showing prior state, new observation, innovation, and updated state.
Panel C: two filters on the same series, one with low $Q/R$ and one with high $Q/R$, to show the responsiveness trade-off visually.

Caption: The Kalman filter estimates a hidden state by combining a model-based prediction with a noisy observation, with the gain determined by relative uncertainty.

Common Mistakes

WRONG: Treat the Kalman filter as just a complicated moving average.

CORRECT: It is a probabilistic state estimator. Smoothing is a consequence of the latent-state model and the noise assumptions, not the whole story.

WRONG: Use smoothed state estimates from a two-sided pass when building production features.

CORRECT: Use filtered estimates that only condition on information available up to time $t$. Smoothers use future observations and create look-ahead bias.

WRONG: Tune $Q$ and $R$ by eye on the full sample, then treat the resulting features as point-in-time correct.

CORRECT: Estimate or calibrate the noise parameters inside each training fold and carry them forward causally.

WRONG: Interpret a large innovation as automatically tradable alpha.

CORRECT: A large innovation only means the observation was surprising under the current model. It may signal mean reversion, breakout behavior, or model misspecification.

Connections

Book chapters: Ch09 Model-Based Feature Extraction; Ch13 Deep Learning for Time Series
Related primers: hidden-markov-models-and-regime-detection.md, garch-family-models.md, fractional-differencing.md
Why it matters next: this is the statistical backbone for kalman_level, kalman_trend, kalman_innovation, uncertainty-aware hedge ratios, and other fitted-object features used downstream in ML pipelines

Register to Read

Create Free Account

Already have an account? Sign in

Chapter

9 Model-Based Feature Extraction

More Primers

Autoregressive, Moving-Average, and ARIMA Foundations for Feature Engineering Bayesian Inference and MCMC for Time Series Fractional Differencing and Long Memory in Financial Features Path Signatures and Log-Signatures for Financial Sequences Structural Break Diagnostics and Time-Since-Break Features Uncertainty as a Feature: Stochastic Volatility, Forecast Intervals, and Forecast Uncertainty Wavelets for Multi-Scale Diagnostics and Causal Feature Design