Coverage-Aware Evaluation and Event-Time Alignment for Text Signals
A text model is not useful because it predicts labels accurately. It is useful only if its signal is available when you trade, on enough names, at the horizon that matters.
Coverage-Aware Evaluation and Event-Time Alignment for Text Signals
A text model is not useful because it predicts labels accurately. It is useful only if its signal is available when you trade, on enough names, at the horizon that matters.
The Intuition
Text signals fail in finance for reasons that ordinary NLP benchmarks do not see.
A sentiment classifier may achieve high accuracy on a labeled sentence dataset and still produce a bad trading feature because:
- the coverage is sparse and concentrated in a few names
- the text arrives after the trade decision you are trying to simulate
- the signal decays in hours while you evaluate it on a 20-day horizon
- the model abstains exactly when the market states you care about most are present
That is why Chapter 10 needs a different evaluation frame. The question is not only "does the model classify well?" It is:
when is the signal available, on which observations, and does it add information at the event horizon where the feature is supposed to matter?
For financial text, evaluation is therefore a joint problem of model quality, timing discipline, and coverage geometry.
Why Accuracy Is Not Enough
Suppose a model labels earnings-call sentences as positive or negative. Standard NLP evaluation asks about precision, recall, and F1 on a held-out labeled set. Those numbers are useful, but they are upstream diagnostics.
The downstream feature lives at the asset-date level, not the sentence level. It is created after:
- document arrival
- timestamp validation
- sentence or chunk scoring
- pooling or aggregation
- mapping to the trade horizon
At that point the relevant object is a tradable feature series $x_{i,t}$, not a document classifier. Two models with similar sentence-level F1 can produce very different cross-sectional information coefficients, turnover, and universe coverage.
The evaluation stack should therefore separate four layers:
| Layer | Main question | Typical metric |
|---|---|---|
| model layer | does the text model classify or extract correctly? | F1, AUROC, calibration |
| feature layer | does the aggregated text signal carry incremental information? | IC, spread, conditional IC |
| availability layer | is the signal present on enough names, and when does it first become tradable? | coverage, event-time lag |
| persistence layer | how long does the signal survive, and what turnover does that imply? | decay, horizon IC, turnover |
The chapter already covers the workflow. This primer adds the metric logic that prevents readers from stopping at the first layer.
Coverage Is a First-Class Metric
Coverage is not a nuisance statistic. It determines what kind of strategy a text signal can support.
Let
$$ \text{Coverage}_t = \frac{\#\{i : x_{i,t}\ \text{is available}\}}{\#\{i : i\ \text{is in the investable universe at } t\}}. $$
This can vary for structural reasons:
- only large firms issue enough transcripts or guidance text
- certain sectors disclose differently
- some event types cluster in crisis periods
- extraction pipelines abstain when language is noisy or ambiguous
High average coverage can still be misleading. A signal with 70% average coverage may cover nearly all large-cap tech names and almost none of the rest. For cross-sectional trading, that changes the effective universe and can create hidden style or liquidity tilts.
That is why a proper coverage report should show:
- average and median coverage
- coverage by sector, market cap, and liquidity bucket
- coverage through time
- coverage conditional on the event type that generates the feature
Coverage-aware evaluation is especially important when comparing simple baselines with richer models. A larger language model may improve document-level quality but reduce usable coverage because it fails, times out, or abstains more often on messy documents.
Event-Time Alignment
Text signals are usually event-driven. The event clock is therefore often more informative than calendar time.
Take an earnings call. The sequence that matters is:
- earnings release timestamp
- call start time
- transcript availability
- feature computation time
- next executable trade decision
If you stamp the feature at the quarter end, at the earnings date, or at the transcript publication time interchangeably, you are changing the experiment. The right timestamp is the first time the feature could actually be known by the strategy.
For event-time analysis, define the post-event return at horizon h as
$$ r_{i,t \to t+h} = \frac{P_{i,t+h} - P_{i,t}}{P_{i,t}}, $$
where t is the first executable timestamp after the text becomes available, not the nominal event
date. For short horizons, simple returns are usually fine; at longer horizons, some researchers
prefer log returns for cleaner aggregation.
This matters because:
- after-close filings should usually map to next-session execution
- transcripts often arrive with a lag after the live call
- revised articles or vendor backfills can make the text appear earlier than it really was
- event windows that start too early create false foresight
That backfill problem is not a minor bookkeeping detail. If a vendor later replaces partial text with a cleaned full version or corrects timestamps, the historical feature can quietly become better timed and broader than the live signal ever was.
Event-time alignment is therefore not bookkeeping. It is the difference between a tradable signal and a contaminated backtest.
Horizon Matching and Signal Decay
A text signal should be judged at the horizon implied by its mechanism.
Examples:
- guidance tone may matter over days to weeks
- a surprise bankruptcy headline may matter intraday
- a slow-moving narrative drift signal may matter over months
If you evaluate a fast event signal only at a 20-day horizon, you may wash out the effect. If you evaluate a slow disclosure signal only on next-day returns, you may conclude there is no edge when the horizon is simply wrong.
One compact way to summarize decay is to compare information coefficients across horizons:
$$ D(h) = \frac{\operatorname{IC}(h)}{\operatorname{IC}(1)}, $$
where D(h) measures how much of the horizon-1 signal survives out to horizon h. The exact
summary can vary, but the point is always the same: a signal with strong day-1 IC and near-zero
day-10 IC is a different object from one whose information decays slowly over weeks. This ratio is
only sensible when $\operatorname{IC}(1)$ is meaningfully different from zero; otherwise the raw
horizon-by-horizon IC curve is the safer summary.
A practical evaluation table should therefore report:
- horizon-specific IC or rank correlation
- decile or quantile spreads by horizon
- cumulative event-time response for event-driven signals
- signal half-life or decay pattern
The point is not to search every horizon until something looks good. The point is to test a small set of horizons that match the mechanism and then document which one the design actually supports.
A Worked Example
Suppose you build a feature from earnings-call transcripts that measures management confidence.
The pipeline is:
- use the transcript timestamp, not the fiscal quarter end
- score each chunk with a fine-tuned financial language model
- pool chunk scores into a document-level confidence feature
- trade at the next market open if the transcript arrived after hours
Now compare two evaluation reports.
Minimal report
- sentence classification F1:
0.84 - one average coverage number:
62% - timestamps anchored to the earnings date rather than transcript availability
- one return horizon: next 20 trading days
This is not a useless report. But it still hides the operational failure modes that determine whether the resulting feature is tradable.
Useful report
- document coverage:
62%overall,84%in large caps,29%in small caps - timestamp audit: transcripts arriving after 4 PM ET mapped to next-session execution
- horizon IC: strongest at 5-day and 10-day windows, weak by 20-day
- event-time response: most of the move occurs between next open and day 3
- conditional coverage: abstention rises sharply in crisis quarters when transcripts are longest and language is most ambiguous
The second report changes the conclusion. The feature may still be useful, but it is not a universal earnings signal. It is a medium-horizon signal concentrated in larger names, with weaker reliability in stressed disclosure regimes.
That is the sort of statement a researcher can actually use.
What Good Evaluation Looks Like
For text signals, a good review combines four questions:
- Was the text available when the strategy claims it was?
- How much of the universe does the signal cover, and where are the holes?
- At which horizons does the signal add information?
- Does the signal survive when tested in event time rather than only in calendar aggregates?
The best evaluation reports usually contain one table and one figure:
- a coverage table by subgroup and period
- an event-time or horizon-response figure showing when the signal actually matters
Those two artifacts do different jobs. The table tells you where the signal exists. The figure tells you when the signal matters. Without both, a text feature can look stronger and broader than it really is.
If either is missing, the research is usually under-specified.
In Practice
Use these rules:
- report model metrics and feature metrics separately
- stamp the feature at first availability, then map to the first executable trade time
- show coverage through time and by subgroup, not just one average number
- test horizons that match the mechanism rather than recycling a default window
- treat abstention and missingness as signal-shaping behavior, not as harmless preprocessing detail
Common Mistakes
- Treating sentence-level F1 as if it were a trading metric.
- Ignoring sparse or uneven coverage across the investable universe.
- Timestamping text at the event date rather than the first usable availability time.
- Evaluating all text factors on the same default return horizon.
- Letting revised text or vendor backfills leak into historical feature construction.
Connections
This primer supports Chapter 10's workflow for turning text into features. It connects directly to point-in-time-safe text pipelines, long-document encoding, domain adaptation, and later chapters on backtesting and model governance where missingness, coverage, and execution timing become operational constraints.
Register to Read
Sign up for a free account to access all 61 primer articles.
Create Free AccountAlready have an account? Sign in