Chapter 11: The ML Pipeline

Loss Functions, Error Metrics, and What They Hide

A model is trained to optimize one quantity, selected on another, and traded on a third. Most confusion in predictive modeling starts when those three layers are blurred together.

Loss Functions, Error Metrics, and What They Hide

A model is trained to optimize one quantity, selected on another, and traded on a third. Most confusion in predictive modeling starts when those three layers are blurred together.

The Intuition

Chapter 11 introduces MSE, MAE, Huber loss, cross-entropy, IC, RMSE, and turnover. Those names can look like a grab bag. They are not. They belong to three different layers:

Layer Question Typical objects
Training loss what parameter is the model fit to? MSE, MAE, Huber, log-loss
Evaluation metric how do we compare models on held-out data? RMSE, MAE, log-loss, AUC, IC
Trading objective how do predictions become PnL after frictions? turnover, net spread, Sharpe

Serious mistakes happen when a metric from one layer is treated as if it answered a question from another.

A model can reduce MSE while making a worse trading strategy. A classifier can improve accuracy while becoming less useful for ranking. A forecast can have attractive RMSE but terrible tail control.

The right question is not "what metric is best?" It is:

what statistical functional does this loss target, and does that functional match the decision?

Losses Target Different Objects

Suppose \(Y\) is the future return and \(\hat{y}(x)\) is a point forecast conditional on features \(x\).

Squared Error Targets the Conditional Mean

$$ L_{\text{MSE}}(y,\hat{y}) = (y-\hat{y})^2. $$

The Bayes-optimal forecast under squared loss is

$$ \hat{y}^*(x) = \mathbb{E}[Y \mid X=x]. $$

So MSE is not "the default regression loss." It is the loss for the conditional mean.

Absolute Error Targets the Conditional Median

$$ L_{\text{MAE}}(y,\hat{y}) = |y-\hat{y}|. $$

Its Bayes-optimal target is the conditional median:

$$ \hat{y}^*(x) = \operatorname{Median}(Y \mid X=x). $$

That is why MAE is more robust to a few extreme moves: medians move less than means.

Huber Clips Tail Influence

Huber loss is quadratic near zero and linear in the tails:

$$ L_\delta(y,\hat{y}) = \begin{cases} \frac{1}{2}(y-\hat{y})^2, & |y-\hat{y}| \le \delta \\ \delta |y-\hat{y}| - \frac{1}{2}\delta^2, & |y-\hat{y}| > \delta. \end{cases} $$

It is a compromise: keep MSE's smoothness for typical residuals, cap the influence of outliers. Huber loss is best understood as a robust M-estimation objective with clipped tail influence, not as cleanly targeting the conditional mean or conditional median in the same way MSE and MAE do.

Log-Loss Targets Calibrated Probabilities

For binary labels \(y \in \{0,1\}\) and predicted probability \(p\),

$$ L_{\log}(y,p) = -y \log p - (1-y)\log(1-p). $$

The optimum is the true conditional event probability:

$$ p^*(x) = \mathbb{P}(Y=1 \mid X=x). $$

That is why log-loss is a proper scoring rule: it rewards honest probabilities, not just correct class labels.

Pinball Loss Targets a Conditional Quantile

For quantile level \(\tau \in (0,1)\),

$$ L_\tau(y,\hat{y}) = \begin{cases} \tau (y-\hat{y}), & y \ge \hat{y} \\ (1-\tau)(\hat{y}-y), & y < \hat{y}. \end{cases} $$

Its Bayes-optimal target is the \(\tau\)-th conditional quantile. That is why pinball loss is the natural objective when the decision focuses on tails rather than the center of the distribution.

The Decision Table

Training loss Targets Good when Hides
MSE conditional mean symmetric cost of error, smooth optimization extreme returns dominate fit
MAE conditional median fat tails, stable center median can ignore economically important tails
Huber robust center mostly Gaussian core with occasional shocks threshold choice matters
Log-loss calibrated probability thresholding, probability sizing, risk-aware classification low log-loss does not guarantee good ranking for long-short selection
Pinball loss conditional quantile tail forecasting, top/bottom-decile decisions single quantile ignores full distribution

The key point is that different losses estimate different decision-relevant objects. There is no universal champion.

Evaluation Metrics Are Not Training Losses

A model may be fit with one loss and evaluated with another. That is often correct.

RMSE and MAE

These summarize numeric forecast error:

$$ \operatorname{RMSE} = \sqrt{\frac{1}{n}\sum (y_i-\hat{y}_i)^2}, \qquad \operatorname{MAE} = \frac{1}{n}\sum |y_i-\hat{y}_i|. $$

RMSE is scale-sensitive and punishes large misses heavily. MAE is more stable but less sensitive to tail failures.

Neither has a universal threshold for "good." Both are meaningful only relative to a benchmark and a label definition.

\(R^2\)

$$ R^2 = 1 - \frac{\sum (y_i-\hat{y}_i)^2}{\sum (y_i-\bar{y})^2}. $$

It measures fit relative to the constant-mean predictor. In noisy financial prediction problems, \(R^2\) can be tiny even when the model is economically useful. A small positive \(R^2\) does not mean uselessness; a negative \(R^2\) means the benchmark beat you.

Accuracy and AUC

Accuracy ignores calibration and confidence. It also depends on the classification threshold. AUC checks pairwise ranking quality across thresholds, which is often better, but AUC can still be irrelevant for a strategy that trades only the most extreme predicted names.

IC

For cross-sectional trading, Spearman IC often matters more than RMSE because the portfolio cares about rank ordering, not absolute return scale. That is why Chapter 11 treats IC as the primary evaluation metric for ranked signals and keeps turnover alongside it.

So the metric layer already depends on the trading map:

  • absolute error metrics for forecasting scale
  • rank metrics for cross-sectional sorting
  • probability metrics for classification and calibration

A Worked Mismatch Example

Suppose you predict next-month stock returns for 1,000 names and trade only the top and bottom 50.

Two models:

  • Model A improves MSE by better fitting the middle 900 names
  • Model B leaves MSE unchanged but improves the ordering in the top and bottom tails

If you select with MSE alone, you may choose Model A even though its traded deciles are worse. The problem is not that MSE is "wrong." It is a sum over all observations, so improvements in the dense middle of the cross-section dominate the loss even when the traded tails get worse. It is targeting the conditional mean over the full cross-section, while the strategy only cares about tail ordering.

The same mismatch appears in classification:

  • a model with better accuracy near the 50% threshold may be worse at ranking the highest-confidence longs and shorts
  • a model with lower log-loss may still create more turnover and worse net PnL

This is why the model-selection layer and the trading-evaluation layer must remain separate.

Why Proper Scoring Rules Matter

A loss is proper if truthful reporting of the target object minimizes expected loss. MSE is proper for the conditional mean functional, while log-loss is a strictly proper scoring rule for the full Bernoulli probability forecast. Accuracy is not proper, because the best way to maximize accuracy is often to report hard labels rather than honest probabilities.

That distinction matters whenever outputs are used for sizing, thresholding, or uncertainty monitoring. If you want probabilities you can trust, optimize and evaluate with probability-aware metrics such as log-loss and calibration diagnostics, not accuracy alone.

What the Metrics Hide

Every metric suppresses some dimension of the problem:

  • MSE/RMSE hide whether the errors occur in tradable tails or irrelevant middle names
  • MAE hides whether rare large misses are catastrophic
  • Accuracy hides confidence and class imbalance
  • AUC hides calibration and threshold-specific business value
  • IC hides turnover, capacity, and spread compression
  • Sharpe hides tail asymmetry and path risk

That is why one-number model ranking is dangerous in finance. The right evaluation block is a small set of complementary diagnostics aligned with the decision rule.

In Practice

Use this sequence:

  1. Choose the estimand first: mean, median, quantile, or probability.
  2. Pick a training loss that targets that object.
  3. Pick held-out metrics that reflect the signal-to-trade mapping.
  4. Report turnover and cost sensitivity separately rather than pretending the statistical metric absorbed them.

For Chapter 11-style baselines:

  • use MSE or Huber when return magnitudes matter
  • use MAE when fat tails distort the mean estimate
  • use log-loss when the output is a probability
  • use IC when the strategy trades ranks
  • never confuse a lower training loss with better net trading performance

Common Mistakes

  • Treating training loss, validation metric, and trading objective as the same thing.
  • Using accuracy to judge probability forecasts.
  • Using RMSE as if it directly measured tradability.
  • Declaring MAE "more robust" without acknowledging that it changes the estimand from mean to median.
  • Comparing metrics across different label definitions or sample universes.

Connections

This primer supports Chapter 11's sections on regression, logistic regression, and model evaluation. It connects directly to quantile regression, IC/ICIR, probability calibration, conformal prediction, and the later chapters on transaction costs and portfolio construction where statistical fit finally meets implementation.

Register to Read

Sign up for a free account to access all 61 primer articles.

Create Free Account

Already have an account? Sign in