Chapter 9: Model-Based Feature Extraction

Autoregressive, Moving-Average, and ARIMA Foundations for Feature Engineering

ARIMA is rarely the star predictor in liquid markets, but it is still one of the cleanest ways to separate level, persistence, shock, and forecast uncertainty before downstream models take over.

Autoregressive, Moving-Average, and ARIMA Foundations for Feature Engineering

ARIMA is rarely the star predictor in liquid markets, but it is still one of the cleanest ways to separate level, persistence, shock, and forecast uncertainty before downstream models take over.

The Intuition

Chapter 9 treats ARIMA as a supporting tool rather than a grand forecasting solution. That is the right stance for finance, but readers still need the underlying language:

  • AR says the present depends on its own past
  • MA says the present depends on past shocks
  • ARIMA adds differencing to handle nonstationary levels

The reason this matters for feature engineering is not "ARIMA beats modern ML." It is that these models create interpretable objects:

  • residuals after simple linear structure is removed
  • one-step forecasts for ancillary series like realized volatility or spreads
  • forecast errors and interval widths
  • persistence summaries and mean-reversion strength

Those are often more useful than the raw fitted model itself.

For daily returns, the natural starting point is often ARMA on an already approximately stationary series. ARIMA becomes more plausible for levels or other ancillary series that only look stable after differencing.

AR(p): Persistence in the Series Itself

An autoregressive model of order \(p\) is

$$ x_t = c + \phi_1 x_{t-1} + \cdots + \phi_p x_{t-p} + \varepsilon_t. $$

The current value depends on lagged values plus a new innovation \(\varepsilon_t\).

Interpretation:

  • positive \(\phi_1\) suggests persistence
  • negative \(\phi_1\) suggests mean reversion or oscillation
  • higher-order terms allow richer decay patterns

For stationarity, the AR dynamics must not explode. In AR(1), that means

$$ |\phi_1| < 1. $$

For general AR(\(p\)), the same idea is stated through the lag polynomial \(\phi(z)=1-\phi_1 z-\cdots-\phi_p z^p\): all roots of \(\phi(z)=0\) must lie outside the unit circle. That condition ensures shocks die out rather than building up indefinitely.

When the root is exactly on the unit circle, you are in unit-root territory. When it lies inside the unit circle, the process is explosive. Those are different failure modes, even though both break the stationary AR story.

MA(q): Persistence in the Shock Process

A moving-average model of order \(q\) is

$$ x_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q}. $$

This is not "a moving average" in the rolling-window sense. It is a model where today's value is a weighted sum of recent innovations.

The intuition:

  • AR captures persistence in the observed level
  • MA captures persistence in the shock transmission mechanism

For identification, the MA polynomial must also be invertible: the roots of \(\theta(z)=1+\theta_1 z+\cdots+\theta_q z^q\) should lie outside the unit circle. Otherwise different parameter values can describe the same observed process.

That distinction is useful in finance because a series can look serially structured either because the state evolves slowly or because the impact of shocks is spread over several periods.

ARMA(p, q): Both Kinds of Memory

For a stationary and invertible series, AR and MA terms are combined as

$$ x_t = c + \sum_{i=1}^p \phi_i x_{t-i} + \varepsilon_t + \sum_{j=1}^q \theta_j \varepsilon_{t-j}. $$

ARMA is often the cleanest baseline for approximately stationary inputs such as spreads, realized volatility, funding rates, or basis measures. The point is not to search for elaborate high-order structures. In practice, low orders often suffice:

  • ARMA(1,1)
  • ARMA(2,1)
  • ARMA(1,2)

Those already give you model-based residuals and one-step forecasts.

Why ARIMA Adds Differencing

Financial levels are often nonstationary. Prices can behave like unit-root processes. Realized volatility can shift across regimes. Macro-like ancillary series may move in levels rather than around a fixed mean.

ARIMA handles that by applying differencing before the ARMA dynamics:

$$ \phi(L)(1-L)^d x_t = c + \theta(L)\varepsilon_t, $$

where:

  • \(L\) is the lag operator
  • \(d\) is the differencing order
  • \(\phi(L)\) and \(\theta(L)\) are AR and MA lag polynomials

This is not just a procedural trick. It means the original series is assumed to become approximately stationary after differencing order \(d\), and the ARMA structure is placed on that transformed series.

ARIMA(1,1,1), for example, means:

  • difference once to induce approximate stationarity
  • fit AR(1) and MA(1) dynamics to the differenced series

This is where the "I" comes from: integrated means the model is applied after differencing.

A Worked ARIMA(1,1,1) Interpretation

Suppose \(x_t\) is log realized volatility. Define the first difference

$$ \Delta x_t = x_t - x_{t-1}. $$

An ARIMA(1,1,1) model is

$$ \Delta x_t = c + \phi \Delta x_{t-1} + \varepsilon_t + \theta \varepsilon_{t-1}. $$

This is the scalar form of \((1-\phi L)(1-L)x_t = c + (1+\theta L)\varepsilon_t\).

This says:

  • the level is nonstationary enough to difference once
  • changes in the series may still be serially structured
  • recent shocks may echo into the next observation

Here the constant \(c\) is a drift term in the differenced equation, not a stationary mean for the level itself. With \(d=1\), a nonzero drift implies a deterministic trend component in the level path, so misreading the constant leads directly to bad multi-step forecasts.

From a walk-forward fit in each training window, you can extract:

  • $arima_forecast$: one-step-ahead level or change forecast
  • $arima_residual$: the innovation after simple linear dynamics are removed
  • $forecast_se$: one-step forecast standard error
  • residual whiteness diagnostics such as Ljung-Box p-values

These are exactly the kinds of objects Chapter 9 wants.

What ARIMA Is Good For in Finance

For many liquid daily return series, low-order ARIMA mean equations are weak standalone predictors. That is not a bug. It is a stylized fact.

ARIMA still earns its keep in three places:

  1. Prewhitening. Remove simple linear dependence before fitting volatility or regime models.
  2. Ancillary-series forecasting. Forecast series that are more predictable than returns themselves, such as spreads, funding, or realized volatility.
  3. Residual features. Use deviations from the model's expected path as surprise variables.

This is why Chapter 9 places ARIMA in the feature-engineering stack rather than selling it as a complete alpha engine.

Order Selection Without Fooling Yourself

The temptation is to over-search on \(p,d,q\). That is exactly where ARIMA pipelines go bad.

Rules that keep the model useful:

  • start with low orders
  • difference only when diagnostics justify it
  • search orders inside each walk-forward training window
  • treat AIC/BIC as guides, not automatic truth

If you select \((p,d,q)\) on the full sample and then backtest the resulting residuals, the lag structure itself has leaked future information.

When Not to Difference

Differencing is not free. It can destroy slow-moving but still informative structure in persistent ancillary series.

That is why persistence alone is not enough to justify \(d=1\). Some volatility-like inputs are highly persistent without being well described as integrated level processes. If diagnostics suggest the series is already usable as stationary or near-stationary, start with ARMA and keep the lower- frequency content intact.

Diagnostics That Matter

A compact ARIMA diagnostic block should check:

  • residual autocorrelation: did the model actually remove simple linear structure?
  • residual whiteness: are simple serial patterns still left over?
  • coefficient stability across refits: are the dynamics changing too quickly for the model to be meaningful?
  • forecast error variance: is uncertainty itself informative?
  • whether the input should have been differenced at all

The goal is not a perfect Box-Jenkins textbook workflow. It is a useful, causal summary of linear dependence that leaves cleaner inputs for later models.

In Practice

Use ARIMA sparingly and deliberately:

  • for liquid returns, expect residuals to matter more than forecasts
  • for volatility-like or spread-like series, forecasts may be useful features
  • for downstream GARCH or regime models, ARIMA is often a preprocessing layer
  • keep orders small and refits walk-forward

If a low-order ARIMA cannot remove the most obvious serial dependence, that failure is itself a signal that the series needs a different class of model.

Common Mistakes

  • Treating MA as a rolling-window smoother instead of a shock process.
  • Letting automated order search run on the full sample.
  • Differencing automatically without asking what information is destroyed.
  • Using ARIMA forecasts on liquid returns as if they were strong standalone alphas.
  • Forgetting that residuals and forecast uncertainty are often the main feature outputs.

Connections

This primer supports Chapter 9's volatility and uncertainty sections. It connects directly to stationarity diagnostics, fractional differencing, GARCH/HAR feature extraction, and the broader question of when a simple linear temporal baseline is enough to clean the series before more flexible models are layered on top.

Register to Read

Sign up for a free account to access all 61 primer articles.

Create Free Account

Already have an account? Sign in