Chapter 11: The ML Pipeline

Loss Functions, Error Metrics, and What They Hide

A model is trained to optimize one quantity, selected on another, and traded on a third. Most confusion in predictive modeling starts when those three layers are blurred together.

Loss Functions, Error Metrics, and What They Hide

A model is trained to optimize one quantity, selected on another, and traded on a third. Most confusion in predictive modeling starts when those three layers are blurred together.

The Intuition

Chapter 11 introduces MSE, MAE, Huber loss, cross-entropy, IC, RMSE, and turnover. Those names can look like a grab bag. They are not. They belong to three different layers:

Layer	Question	Typical objects
Training loss	what parameter is the model fit to?	MSE, MAE, Huber, log-loss
Evaluation metric	how do we compare models on held-out data?	RMSE, MAE, log-loss, AUC, IC
Trading objective	how do predictions become PnL after frictions?	turnover, net spread, Sharpe

Serious mistakes happen when a metric from one layer is treated as if it answered a question from another.

A model can reduce MSE while making a worse trading strategy. A classifier can improve accuracy while becoming less useful for ranking. A forecast can have attractive RMSE but terrible tail control.

The right question is not "what metric is best?" It is:

what statistical functional does this loss target, and does that functional match the decision?

Losses Target Different Objects

Suppose $Y$ is the future return and $\hat{y}(x)$ is a point forecast conditional on features $x$.

Squared Error Targets the Conditional Mean

$$ L_{\text{MSE}}(y,\hat{y}) = (y-\hat{y})^2. $$

The Bayes-optimal forecast under squared loss is

$$ \hat{y}^*(x) = \mathbb{E}[Y \mid X=x]. $$

So MSE is not "the default regression loss." It is the loss for the conditional mean.

Absolute Error Targets the Conditional Median

$$ L_{\text{MAE}}(y,\hat{y}) = |y-\hat{y}|. $$

Its Bayes-optimal target is the conditional median:

$$ \hat{y}^*(x) = \operatorname{Median}(Y \mid X=x). $$

That is why MAE is more robust to a few extreme moves: medians move less than means.

Huber Clips Tail Influence

Huber loss is quadratic near zero and linear in the tails:

$$ L_\delta(y,\hat{y}) = \begin{cases} \frac{1}{2}(y-\hat{y})^2, & |y-\hat{y}| \le \delta \\ \delta |y-\hat{y}| - \frac{1}{2}\delta^2, & |y-\hat{y}| > \delta. \end{cases} $$

It is a compromise: keep MSE's smoothness for typical residuals, cap the influence of outliers. Huber loss is best understood as a robust M-estimation objective with clipped tail influence, not as cleanly targeting the conditional mean or conditional median in the same way MSE and MAE do.

Log-Loss Targets Calibrated Probabilities

For binary labels $y \in \{0,1\}$ and predicted probability $p$,

$$ L_{\log}(y,p) = -y \log p - (1-y)\log(1-p). $$

The optimum is the true conditional event probability:

$$ p^*(x) = \mathbb{P}(Y=1 \mid X=x). $$

That is why log-loss is a proper scoring rule: it rewards honest probabilities, not just correct class labels.

Pinball Loss Targets a Conditional Quantile

For quantile level $\tau \in (0,1)$,

$$ L_\tau(y,\hat{y}) = \begin{cases} \tau (y-\hat{y}), & y \ge \hat{y} \\ (1-\tau)(\hat{y}-y), & y < \hat{y}. \end{cases} $$

Its Bayes-optimal target is the $\tau$-th conditional quantile. That is why pinball loss is the natural objective when the decision focuses on tails rather than the center of the distribution.

The Decision Table

Training loss	Targets	Good when	Hides
MSE	conditional mean	symmetric cost of error, smooth optimization	extreme returns dominate fit
MAE	conditional median	fat tails, stable center	median can ignore economically important tails
Huber	robust center	mostly Gaussian core with occasional shocks	threshold choice matters
Log-loss	calibrated probability	thresholding, probability sizing, risk-aware classification	low log-loss does not guarantee good ranking for long-short selection
Pinball loss	conditional quantile	tail forecasting, top/bottom-decile decisions	single quantile ignores full distribution

The key point is that different losses estimate different decision-relevant objects. There is no universal champion.

Evaluation Metrics Are Not Training Losses

A model may be fit with one loss and evaluated with another. That is often correct.

RMSE and MAE

These summarize numeric forecast error:

$$ \operatorname{RMSE} = \sqrt{\frac{1}{n}\sum (y_i-\hat{y}_i)^2}, \qquad \operatorname{MAE} = \frac{1}{n}\sum |y_i-\hat{y}_i|. $$

RMSE is scale-sensitive and punishes large misses heavily. MAE is more stable but less sensitive to tail failures.

Neither has a universal threshold for "good." Both are meaningful only relative to a benchmark and a label definition.

$R^2$

$$ R^2 = 1 - \frac{\sum (y_i-\hat{y}_i)^2}{\sum (y_i-\bar{y})^2}. $$

It measures fit relative to the constant-mean predictor. In noisy financial prediction problems, $R^2$ can be tiny even when the model is economically useful. A small positive $R^2$ does not mean uselessness; a negative $R^2$ means the benchmark beat you.

Accuracy and AUC

Accuracy ignores calibration and confidence. It also depends on the classification threshold. AUC checks pairwise ranking quality across thresholds, which is often better, but AUC can still be irrelevant for a strategy that trades only the most extreme predicted names.

IC

For cross-sectional trading, Spearman IC often matters more than RMSE because the portfolio cares about rank ordering, not absolute return scale. That is why Chapter 11 treats IC as the primary evaluation metric for ranked signals and keeps turnover alongside it.

So the metric layer already depends on the trading map:

absolute error metrics for forecasting scale
rank metrics for cross-sectional sorting
probability metrics for classification and calibration

A Worked Mismatch Example

Suppose you predict next-month stock returns for 1,000 names and trade only the top and bottom 50.

Two models:

Model A improves MSE by better fitting the middle 900 names
Model B leaves MSE unchanged but improves the ordering in the top and bottom tails

If you select with MSE alone, you may choose Model A even though its traded deciles are worse. The problem is not that MSE is "wrong." It is a sum over all observations, so improvements in the dense middle of the cross-section dominate the loss even when the traded tails get worse. It is targeting the conditional mean over the full cross-section, while the strategy only cares about tail ordering.

The same mismatch appears in classification:

a model with better accuracy near the 50% threshold may be worse at ranking the highest-confidence longs and shorts
a model with lower log-loss may still create more turnover and worse net PnL

This is why the model-selection layer and the trading-evaluation layer must remain separate.

Why Proper Scoring Rules Matter

A loss is proper if truthful reporting of the target object minimizes expected loss. MSE is proper for the conditional mean functional, while log-loss is a strictly proper scoring rule for the full Bernoulli probability forecast. Accuracy is not proper, because the best way to maximize accuracy is often to report hard labels rather than honest probabilities.

That distinction matters whenever outputs are used for sizing, thresholding, or uncertainty monitoring. If you want probabilities you can trust, optimize and evaluate with probability-aware metrics such as log-loss and calibration diagnostics, not accuracy alone.

What the Metrics Hide

Every metric suppresses some dimension of the problem:

MSE/RMSE hide whether the errors occur in tradable tails or irrelevant middle names
MAE hides whether rare large misses are catastrophic
Accuracy hides confidence and class imbalance
AUC hides calibration and threshold-specific business value
IC hides turnover, capacity, and spread compression
Sharpe hides tail asymmetry and path risk

That is why one-number model ranking is dangerous in finance. The right evaluation block is a small set of complementary diagnostics aligned with the decision rule.

In Practice

Use this sequence:

Choose the estimand first: mean, median, quantile, or probability.
Pick a training loss that targets that object.
Pick held-out metrics that reflect the signal-to-trade mapping.
Report turnover and cost sensitivity separately rather than pretending the statistical metric absorbed them.

For Chapter 11-style baselines:

use MSE or Huber when return magnitudes matter
use MAE when fat tails distort the mean estimate
use log-loss when the output is a probability
use IC when the strategy trades ranks
never confuse a lower training loss with better net trading performance

Common Mistakes

Treating training loss, validation metric, and trading objective as the same thing.
Using accuracy to judge probability forecasts.
Using RMSE as if it directly measured tradability.
Declaring MAE "more robust" without acknowledging that it changes the estimand from mean to median.
Comparing metrics across different label definitions or sample universes.

Connections

This primer supports Chapter 11's sections on regression, logistic regression, and model evaluation. It connects directly to quantile regression, IC/ICIR, probability calibration, conformal prediction, and the later chapters on transaction costs and portfolio construction where statistical fit finally meets implementation.

Register to Read

Create Free Account

Already have an account? Sign in

Chapter

11 The ML Pipeline

More Primers

Classical Statistical Tests as Linear Models: OLS, t-Tests, ANOVA, and Correlation

Loss Functions, Error Metrics, and What They Hide

Loss Functions, Error Metrics, and What They Hide

The Intuition

Losses Target Different Objects

Squared Error Targets the Conditional Mean

Absolute Error Targets the Conditional Median

Huber Clips Tail Influence

Log-Loss Targets Calibrated Probabilities

Pinball Loss Targets a Conditional Quantile

The Decision Table

Evaluation Metrics Are Not Training Losses

RMSE and MAE

\(R^2\)

Accuracy and AUC

IC

A Worked Mismatch Example

Why Proper Scoring Rules Matter

What the Metrics Hide

In Practice

Common Mistakes

Connections

Register to Read