Chapter 24: Autonomous Agents

Proper Scoring Rules for Financial Event Forecasts

A scoring rule is proper when it rewards honest probability assessments and penalizes hedging -- the mathematical foundation for evaluating any agent or model that outputs probabilities over financial events.

Proper Scoring Rules for Financial Event Forecasts

A scoring rule is proper when it rewards honest probability assessments and penalizes hedging -- the mathematical foundation for evaluating any agent or model that outputs probabilities over financial events.

Why This Matters

When an agent or model produces a probability -- "there is a 70% chance this stock reports earnings above consensus" -- you need a principled way to evaluate that number. The obvious question is whether 70% was right, but the deeper question is whether the forecaster was incentivized to say 70% honestly in the first place.

A scoring rule assigns a numerical penalty (or reward) based on the predicted probability and the realized outcome. A scoring rule is proper when the forecaster's expected score is optimized by reporting their true belief distribution [ref:UBIF9SJA]. If the rule is not proper, rational forecasters will game it by misreporting, and your evaluation becomes meaningless.

Multi-agent forecasting systems -- where specialist agents produce event probabilities and a supervisor aggregates them -- depend critically on proper scoring. The AIA Forecaster uses Brier score as its primary performance measure [ref:WK2649QZ], and ForecastBench evaluates LLM forecasters against human superforecasters using the same metric [ref:JTPNE6VU]. Without properness, the entire calibration and aggregation pipeline collapses. This primer covers the theory of proper scoring; calibration techniques and aggregation methods are covered in primers 02 and 03.

Intuition

Imagine you believe an event has probability 0.7. Under a proper scoring rule, stating 0.7 minimizes your expected loss. Stating 0.8 to appear more decisive, or 0.5 to hedge, both make your expected score worse. The rule mechanically enforces honesty by aligning the forecaster's self-interest with truthful reporting.

This is the same logic behind incentive-compatible mechanisms in economics: design the rules so that the optimal strategy for the participant is to reveal their private information honestly.

Formal Core

A scoring rule $S(p, y)$ takes a predicted probability $p$ for a binary event and the realized outcome $y \in \{0,1\}$, returning a score where lower values are better (loss convention).

Properness. A scoring rule is proper if, for any true probability $q$,

$$ \mathbb{E}_{y \sim q}[S(q, y)] \leq \mathbb{E}_{y \sim q}[S(p, y)] \quad \text{for all } p. $$

It is strictly proper if equality holds only when $p = q$. Strict properness means there is a unique optimum at truthful reporting -- no ties, no room for gaming.

Three canonical rules

Brier score (quadratic):

$$ \text{BS}(p, y) = (p - y)^2. $$

Simple, bounded on $[0,1]$, and decomposes cleanly. This is the workhorse for binary event forecasting in both the AIA Forecaster [ref:WK2649QZ] and ForecastBench [ref:JTPNE6VU].

Logarithmic score:

$$ \text{LS}(p, y) = -[y \ln p + (1-y) \ln(1-p)]. $$

Penalizes confident wrong forecasts much more harshly than the Brier score. A prediction of $p = 0.01$ for an event that occurs receives a log score of $-\ln(0.01) \approx 4.6$, while the Brier score is only $(0.01-1)^2 = 0.98$. This makes the log score more sensitive to overconfidence in the tails.

Continuous ranked probability score (CRPS): for continuous outcomes, CRPS generalizes the Brier score to full predictive distributions:

$$ \text{CRPS}(F, y) = \int_{-\infty}^{\infty} [F(z) - \mathbf{1}(z \geq y)]^2 \, dz, $$

where $F$ is the predicted CDF. CRPS reduces to the Brier score for binary events and is the natural choice when agents produce distributional forecasts over returns or volatility [ref:UBIF9SJA].

Brier decomposition

The Brier score decomposes into calibration (reliability) and resolution (sharpness):

$$ \text{BS} = \underbrace{\frac{1}{N}\sum_{k=1}^{K} n_k(\bar{p}_k - \bar{y}_k)^2}_{\text{calibration}} - \underbrace{\frac{1}{N}\sum_{k=1}^{K} n_k(\bar{y}_k - \bar{y})^2}_{\text{resolution}} + \bar{y}(1-\bar{y}), $$

where forecasts are grouped into $K$ bins, $\bar{p}_k$ is the mean predicted probability in bin $k$, $\bar{y}_k$ is the observed frequency, and $n_k$ is the bin count.

Calibration measures how well predicted probabilities match observed frequencies. Lower is better.
Resolution measures how much the forecaster separates events from non-events. Higher is better (subtracted, so it reduces total score).
The third term is the uncertainty of the outcome itself -- not under the forecaster's control.

A forecast can be perfectly calibrated but uninformative (always predicting the base rate), or sharp but poorly calibrated (confident but systematically wrong). Good forecasting requires both.

Elicitability and Why It Matters for Risk

Strict properness connects to a deeper property called elicitability: a statistical functional is elicitable if there exists a strictly consistent scoring function for it. The mean, median, and quantiles (including VaR) are all elicitable. But Expected Shortfall (CVaR) is not -- there is no scoring function that makes truthful CVaR reporting uniquely optimal [ref:UBIF9SJA].

The practical consequence for risk management is significant: you can run model-comparison tournaments for VaR forecasters using strictly proper scores, but you cannot do the same for CVaR forecasters without additional structure. This is one reason regulatory backtesting frameworks still center on VaR exceedance tests.

Worked Example

An agent produces daily up/down probability forecasts for 100 trading days. Group the predictions into five bins:

Bin (predicted $p$)	Count $n_k$	Mean predicted $\bar{p}_k$	Observed freq $\bar{y}_k$	Calibration term
0.0 -- 0.2	15	0.12	0.13	$15 \times (0.12-0.13)^2 = 0.0015$
0.2 -- 0.4	20	0.30	0.25	$20 \times (0.30-0.25)^2 = 0.050$
0.4 -- 0.6	30	0.50	0.53	$30 \times (0.50-0.53)^2 = 0.027$
0.6 -- 0.8	25	0.70	0.72	$25 \times (0.70-0.72)^2 = 0.010$
0.8 -- 1.0	10	0.88	0.80	$10 \times (0.88-0.80)^2 = 0.064$

The calibration component is $(0.0015+0.050+0.027+0.010+0.064)/100 = 0.00153$. The agent is reasonably well calibrated but imperfect, with the largest error in the high-confidence bin where it overstates the probability of the event occurring. A reliability diagram plotting $\bar{p}_k$ against $\bar{y}_k$ would show most points close to the 45-degree diagonal, with the top-right point below the line.

Why Improper Metrics Cause Real Damage

Armstrong and Collopy (1992) show empirically that popular error metrics like RMSE and MAPE can systematically select biased forecasters when the decision objective does not align with the metric [ref:9I5QM7G2]. MAPE, for instance, penalizes over-forecasting more than under-forecasting for positive quantities, so minimizing MAPE can produce chronic under-forecasters that "win" the competition while being useless for decisions that depend on the upper tail. In an agent system, using an improper metric to evaluate and weight forecasters silently corrupts the aggregation step that follows.

Practical Guidance

Use the Brier score for binary event forecasts. It is strictly proper, bounded, and decomposable -- which is why both the AIA Forecaster and ForecastBench adopt it [ref:WK2649QZ] [ref:JTPNE6VU].
Use the log score when you want to penalize tail overconfidence heavily, but be aware that it is unbounded and a single confident-wrong prediction can dominate.
Use CRPS for distributional forecasts over continuous outcomes such as returns or volatility.
Never rank probabilistic forecasters using accuracy, RMSE, or MAPE -- these are improper and reward strategic misreporting.
Decompose the Brier score into calibration and resolution before drawing conclusions. An agent with low Brier score could be well-calibrated but uninformative, or sharp but miscalibrated.

Where It Fits in ML4T

This primer provides the evaluation foundation for Chapter 24's multi-agent forecasting system. Scoring rules determine how individual agent forecasts are assessed (this primer), calibration techniques correct systematic biases (primer 02), and aggregation methods combine multiple forecasts into a consensus (primer 03). The Brier decomposition also connects to the reliability diagrams used in Chapter 11's model evaluation pipeline and the backtesting integrity concerns in Chapter 16, where improper metrics can mask overfitting.

Proper Scoring Rules for Financial Event Forecasts

Proper Scoring Rules for Financial Event Forecasts

Why This Matters

Intuition

Formal Core

Three canonical rules

Brier decomposition

Elicitability and Why It Matters for Risk

Worked Example

Why Improper Metrics Cause Real Damage

Practical Guidance

Where It Fits in ML4T

Register to Read