State Space Models: From Kalman Intuition to Mamba
State space models compress the past into a latent state that is updated recursively, turning long-context sequence processing from a quadratic attention problem into a controlled linear dynamical system — and selective variants like Mamba let the model decide which inputs deserve to update that memory and which should be forgotten.
State Space Models: From Kalman Intuition to Mamba
State space models compress the past into a latent state that is updated recursively, turning long-context sequence processing from a quadratic attention problem into a controlled linear dynamical system — and selective variants like Mamba let the model decide which inputs deserve to update that memory and which should be forgotten.
Supports chapters: 13, 9
Book coverage recap: Chapter 13 positions SSMs (S4, Mamba) alongside Transformers in the architecture decision space for long-context time-series modeling. Chapter 9 introduces Kalman filtering as a model-based feature extraction tool.
This primer adds: The conceptual bridge from classical state-space filtering to modern deep SSMs, with emphasis on what selectivity means economically (not all market events deserve the same memory persistence), how discretization connects continuous dynamics to sampled data, and when SSMs are the right architecture choice versus Transformers or simpler alternatives.
Prerequisites: Basic linear algebra (matrix multiplication), recurrent neural network concept (optional)
Related primers: Making Transformers Time-Aware (Ch 13), State-Space Models and Kalman Filtering (Ch 9), Regime Models (Ch 9)
The intuition
Transformers relate every token to every other token — powerful but expensive for long sequences. State space models take a different route: compress the past into a latent state that is updated one step at a time.
The bridge readers often miss is that modern SSMs descend from classical filtering:
a hidden state stores what the past implies about the present, and each new observation updates that state through learned dynamics.
The difference is that modern models learn far richer transition rules and efficient parallel scan implementations, rather than committing to a small, fixed Kalman specification.
Classical State Space Form
A linear state space model is
$$ h_t = A h_{t-1} + B x_t, \qquad y_t = C h_t + D x_t, $$
where:
- \(x_t\) is the input
- \(h_t\) is the latent state
- \(y_t\) is the output
- \(A,B,C,D\) define the dynamics and readout
This already contains the core trade-off.
- If the state dimension is small, the model is cheap and stable.
- If the state dynamics are too simple, the model forgets important long-range structure.
Kalman filtering adds probabilistic assumptions and optimal inference under linear-Gaussian noise. The full filtering machinery is not needed here. What matters is the intuition that a sequence model can be built by learning how information accumulates in a hidden dynamical system.
Why discretization matters
Modern SSMs typically start from a continuous-time dynamical system and convert it to discrete steps for sampled data. The key idea is simple: between consecutive observations, the hidden state evolves according to the continuous dynamics, and discretization determines how the model's internal clock relates to the data's sampling rate.
The discretization step $\Delta$ controls how the continuous dynamics map to discrete updates. A large $\Delta$ means each step covers more time — the model "sees" coarser dynamics. A small $\Delta$ means finer resolution but potentially more steps.
This matters for three reasons:
-
Stability. The discrete transition matrix $\bar{A}$ must have eigenvalues inside the unit circle, or the state explodes. The discretization method determines whether this is guaranteed.
-
Memory decay. How quickly the model forgets old inputs depends on $\bar{A}$, which depends on both the learned continuous dynamics $A$ and the step size $\Delta$. The same continuous system discretized at different rates produces different effective memory horizons.
-
Frequency resolution. The Nyquist limit applies: the discrete model cannot represent frequencies above $1/(2\Delta)$. For financial data with mixed-frequency inputs (tick data alongside daily macro), the discretization must match the relevant timescale.
This is why SSMs are more than "another recurrent net." The dynamical-system view constrains how memory is stored and propagated, and the discretization links the model's internal timescale to the data's sampling structure.
From Recurrence to Efficient Sequence Mixing
A naive recurrence still has a sequential bottleneck:
$$ h_t \leftarrow \bar{A} h_{t-1} + \bar{B} x_t. $$
So why do SSMs help?
Because many structured SSMs can be rewritten as convolutions or scan operations with efficient parallel implementation. The same linear dynamical system can often be viewed in two ways:
- recurrent mode: update state one step at a time
- convolutional / scan mode: apply an equivalent kernel or associative update efficiently across the sequence
That gives the key payoff for time-series applications: long-context modeling with \(O(L)\) or near-linear cost rather than \(O(L^2)\) attention cost.
Structured State Spaces and Long Memory
Models such as S4 constrain the transition matrix so the dynamics remain expressive but numerically tractable. The main idea is not that every practitioner needs to derive HiPPO matrices from first principles. The important lesson is that:
- carefully structured dynamics can preserve information over long horizons
- those dynamics can still be evaluated efficiently
- this creates a different scaling frontier from both RNNs and Transformers
For financial sequences, that matters when the lookback window becomes very long:
- minute bars over months
- order-flow histories
- multi-resolution volatility and regime information
Simple RNNs forget too aggressively; attention can become too expensive; structured SSMs offer a third option.
What "Selective" Means in Mamba
The classical linear SSM uses fixed \(A,B,C\). In Mamba, the effective discrete-time dynamics become input dependent through learned discretization and input-conditioned \(B\) and \(C\) terms. Schematically, the model behaves like
$$ h_t = \bar{A}(x_t) h_{t-1} + \bar{B}(x_t) x_t, \qquad y_t = C(x_t) h_t, $$
where the notation is schematic: in the original Mamba architecture, the actual selectivity is implemented mainly through input-dependent discretization and read/write terms rather than by making the continuous-time matrix \(A\) itself an arbitrary function of \(x_t\).
This is the crucial move. A fixed transition rule treats all inputs as if they should be absorbed by the same memory rule. A selective SSM lets the model decide:
- which inputs should update memory strongly
- which should be suppressed as noise
- which timescales matter in the current context
That is why "selective" is not just marketing language. It means the model no longer stores the past with one stationary recurrence. It gates the state dynamics using the current input.
For finance, this is economically meaningful. Not every observation deserves the same persistence:
- A routine intraday fluctuation should decay from memory quickly.
- A central bank rate decision should persist for weeks or months.
- A regime shift should update the state strongly and reset expectations.
A fixed-dynamics SSM treats all inputs the same. A selective SSM can learn to gate: absorb policy shocks deeply into the state while letting microstructure noise pass through without altering long-term memory.
Multi-scale selective dynamics
Mamba-2 and related architectures extend selectivity to operate across multiple timescales simultaneously. The intuition: financial markets generate structure at multiple frequencies — tick-level order flow, intraday volatility patterns, daily momentum, weekly sector rotation, monthly macro cycles. A single state-update rule cannot capture all of these efficiently.
Multi-scale SSMs address this by maintaining state components with different effective decay rates. Some dimensions of the state vector evolve slowly (capturing long-horizon structure), while others update rapidly (capturing short-lived patterns). The selective mechanism then determines not just how strongly to update, but which timescale channels to route information into.
This is analogous to the HAR (Heterogeneous Autoregressive) model's insight: realized volatility at daily, weekly, and monthly frequencies captures different aspects of the volatility process. A multi-scale SSM learns a similar decomposition but without pre-specifying the relevant horizons.
A Worked Example
Suppose you model a long sequence of intraday ETF returns and volumes.
Transformer view
The model asks which past timestamps each current position should attend to. This is flexible, but memory and compute grow quickly with sequence length.
RNN view
The model compresses all past information into one hidden state updated sequentially. This is cheap per step but can forget long-range structure and cannot parallelize well.
SSM view
The model treats the sequence as the output of a learned dynamical system. The latent state evolves linearly or selectively, and the full sequence can be processed with a fast scan.
If the strategy hypothesis depends on long volatility-memory effects or multi-scale order-flow patterns, the SSM route can preserve long context without the quadratic attention bill.
That does not make it automatically superior. If the real edge comes from cross-asset interactions at modest context lengths, an attention model may still be the better choice. Hybrid architectures that mix attention with SSM layers are an active compromise, but finance evidence there is still thin.
When SSMs Are Attractive in Finance
SSMs are most attractive when:
- context length is genuinely large
- temporal order matters more than broad cross-token interaction
- latency and memory constraints are real
- the signal may live across multiple timescales
They are less obviously attractive when:
- context lengths are short enough that attention cost is still manageable
- the dominant structure is cross-sectional rather than temporal
- a strong linear or boosted baseline already captures the available signal
So the practical question is not "are SSMs better than Transformers?" It is:
is my bottleneck long sequential memory, or is it something else?
In Practice
Use these rules:
- start from the state-space intuition, not the architecture hype
- ask whether long context is actually part of the economic hypothesis
- benchmark against strong simpler models before paying the implementation cost
- separate temporal-memory advantages from cross-sectional modeling advantages
- treat single-paper finance gains as provisional until they survive walk-forward evaluation
Common Mistakes
- Thinking SSMs are just recurrent nets with new branding.
- Ignoring the discretization step and then treating stability as a tuning nuisance.
- Using long-context architectures when the sequence length is not the real problem.
- Assuming linear-time complexity automatically means better forecasting.
- Forgetting that most published finance evidence for modern SSMs is still thin.
Where it fits in ML4T
Chapter 9 introduces classical state-space models (Kalman filtering) for feature extraction — the primer on State-Space Models and Kalman Filtering covers that machinery. This primer builds the bridge to Chapter 13's modern deep SSMs, where the same latent-state intuition is scaled up with learned dynamics, selective updates, and efficient parallel scans. The Transformers primer (Ch 13) covers the alternative attention-based paradigm and the decision framework for choosing between paradigms. For the broader question of when deep architectures add value over simpler baselines, the linear baseline discussion in Chapter 13 is the honest starting point.
Register to Read
Sign up for a free account to access all 61 primer articles.
Create Free AccountAlready have an account? Sign in