Chapter 10: Text Feature Engineering

Coverage-Aware Evaluation and Event-Time Alignment for Text Signals

A text model is not useful because it predicts labels accurately. It is useful only if its signal is available when you trade, on enough names, at the horizon that matters.

Coverage-Aware Evaluation and Event-Time Alignment for Text Signals

A text model is not useful because it predicts labels accurately. It is useful only if its signal is available when you trade, on enough names, at the horizon that matters.

The Intuition

Text signals fail in finance for reasons that ordinary NLP benchmarks do not see.

A sentiment classifier may achieve high accuracy on a labeled sentence dataset and still produce a bad trading feature because:

the coverage is sparse and concentrated in a few names
the text arrives after the trade decision you are trying to simulate
the signal decays in hours while you evaluate it on a 20-day horizon
the model abstains exactly when the market states you care about most are present

That is why Chapter 10 needs a different evaluation frame. The question is not only "does the model classify well?" It is:

when is the signal available, on which observations, and does it add information at the event horizon where the feature is supposed to matter?

For financial text, evaluation is therefore a joint problem of model quality, timing discipline, and coverage geometry.

Why Accuracy Is Not Enough

Suppose a model labels earnings-call sentences as positive or negative. Standard NLP evaluation asks about precision, recall, and F1 on a held-out labeled set. Those numbers are useful, but they are upstream diagnostics.

The downstream feature lives at the asset-date level, not the sentence level. It is created after:

document arrival
timestamp validation
sentence or chunk scoring
pooling or aggregation
mapping to the trade horizon

At that point the relevant object is a tradable feature series $x_{i,t}$, not a document classifier. Two models with similar sentence-level F1 can produce very different cross-sectional information coefficients, turnover, and universe coverage.

The evaluation stack should therefore separate four layers:

Layer	Main question	Typical metric
model layer	does the text model classify or extract correctly?	F1, AUROC, calibration
feature layer	does the aggregated text signal carry incremental information?	IC, spread, conditional IC
availability layer	is the signal present on enough names, and when does it first become tradable?	coverage, event-time lag
persistence layer	how long does the signal survive, and what turnover does that imply?	decay, horizon IC, turnover

The chapter already covers the workflow. This primer adds the metric logic that prevents readers from stopping at the first layer.

Coverage Is a First-Class Metric

Coverage is not a nuisance statistic. It determines what kind of strategy a text signal can support.

Let

$$ \text{Coverage}_t = \frac{\#\{i : x_{i,t}\ \text{is available}\}}{\#\{i : i\ \text{is in the investable universe at } t\}}. $$

This can vary for structural reasons:

only large firms issue enough transcripts or guidance text
certain sectors disclose differently
some event types cluster in crisis periods
extraction pipelines abstain when language is noisy or ambiguous

High average coverage can still be misleading. A signal with 70% average coverage may cover nearly all large-cap tech names and almost none of the rest. For cross-sectional trading, that changes the effective universe and can create hidden style or liquidity tilts.

That is why a proper coverage report should show:

average and median coverage
coverage by sector, market cap, and liquidity bucket
coverage through time
coverage conditional on the event type that generates the feature

Coverage-aware evaluation is especially important when comparing simple baselines with richer models. A larger language model may improve document-level quality but reduce usable coverage because it fails, times out, or abstains more often on messy documents.

Event-Time Alignment

Text signals are usually event-driven. The event clock is therefore often more informative than calendar time.

Take an earnings call. The sequence that matters is:

earnings release timestamp
call start time
transcript availability
feature computation time
next executable trade decision

If you stamp the feature at the quarter end, at the earnings date, or at the transcript publication time interchangeably, you are changing the experiment. The right timestamp is the first time the feature could actually be known by the strategy.

For event-time analysis, define the post-event return at horizon h as

$$ r_{i,t \to t+h} = \frac{P_{i,t+h} - P_{i,t}}{P_{i,t}}, $$

where t is the first executable timestamp after the text becomes available, not the nominal event date. For short horizons, simple returns are usually fine; at longer horizons, some researchers prefer log returns for cleaner aggregation.

This matters because:

after-close filings should usually map to next-session execution
transcripts often arrive with a lag after the live call
revised articles or vendor backfills can make the text appear earlier than it really was
event windows that start too early create false foresight

That backfill problem is not a minor bookkeeping detail. If a vendor later replaces partial text with a cleaned full version or corrects timestamps, the historical feature can quietly become better timed and broader than the live signal ever was.

Event-time alignment is therefore not bookkeeping. It is the difference between a tradable signal and a contaminated backtest.

Horizon Matching and Signal Decay

A text signal should be judged at the horizon implied by its mechanism.

Examples:

guidance tone may matter over days to weeks
a surprise bankruptcy headline may matter intraday
a slow-moving narrative drift signal may matter over months

If you evaluate a fast event signal only at a 20-day horizon, you may wash out the effect. If you evaluate a slow disclosure signal only on next-day returns, you may conclude there is no edge when the horizon is simply wrong.

One compact way to summarize decay is to compare information coefficients across horizons:

$$ D(h) = \frac{\operatorname{IC}(h)}{\operatorname{IC}(1)}, $$

where D(h) measures how much of the horizon-1 signal survives out to horizon h. The exact summary can vary, but the point is always the same: a signal with strong day-1 IC and near-zero day-10 IC is a different object from one whose information decays slowly over weeks. This ratio is only sensible when $\operatorname{IC}(1)$ is meaningfully different from zero; otherwise the raw horizon-by-horizon IC curve is the safer summary.

A practical evaluation table should therefore report:

horizon-specific IC or rank correlation
decile or quantile spreads by horizon
cumulative event-time response for event-driven signals
signal half-life or decay pattern

The point is not to search every horizon until something looks good. The point is to test a small set of horizons that match the mechanism and then document which one the design actually supports.

A Worked Example

Suppose you build a feature from earnings-call transcripts that measures management confidence.

The pipeline is:

use the transcript timestamp, not the fiscal quarter end
score each chunk with a fine-tuned financial language model
pool chunk scores into a document-level confidence feature
trade at the next market open if the transcript arrived after hours

Now compare two evaluation reports.

Minimal report

sentence classification F1: 0.84
one average coverage number: 62%
timestamps anchored to the earnings date rather than transcript availability
one return horizon: next 20 trading days

This is not a useless report. But it still hides the operational failure modes that determine whether the resulting feature is tradable.

Useful report

document coverage: 62% overall, 84% in large caps, 29% in small caps
timestamp audit: transcripts arriving after 4 PM ET mapped to next-session execution
horizon IC: strongest at 5-day and 10-day windows, weak by 20-day
event-time response: most of the move occurs between next open and day 3
conditional coverage: abstention rises sharply in crisis quarters when transcripts are longest and language is most ambiguous

The second report changes the conclusion. The feature may still be useful, but it is not a universal earnings signal. It is a medium-horizon signal concentrated in larger names, with weaker reliability in stressed disclosure regimes.

That is the sort of statement a researcher can actually use.

What Good Evaluation Looks Like

For text signals, a good review combines four questions:

Was the text available when the strategy claims it was?
How much of the universe does the signal cover, and where are the holes?
At which horizons does the signal add information?
Does the signal survive when tested in event time rather than only in calendar aggregates?

The best evaluation reports usually contain one table and one figure:

a coverage table by subgroup and period
an event-time or horizon-response figure showing when the signal actually matters

Those two artifacts do different jobs. The table tells you where the signal exists. The figure tells you when the signal matters. Without both, a text feature can look stronger and broader than it really is.

If either is missing, the research is usually under-specified.

In Practice

Use these rules:

report model metrics and feature metrics separately
stamp the feature at first availability, then map to the first executable trade time
show coverage through time and by subgroup, not just one average number
test horizons that match the mechanism rather than recycling a default window
treat abstention and missingness as signal-shaping behavior, not as harmless preprocessing detail

Common Mistakes

Treating sentence-level F1 as if it were a trading metric.
Ignoring sparse or uneven coverage across the investable universe.
Timestamping text at the event date rather than the first usable availability time.
Evaluating all text factors on the same default return horizon.
Letting revised text or vendor backfills leak into historical feature construction.

Connections

This primer supports Chapter 10's workflow for turning text into features. It connects directly to point-in-time-safe text pipelines, long-document encoding, domain adaptation, and later chapters on backtesting and model governance where missingness, coverage, and execution timing become operational constraints.

Register to Read

Create Free Account

Already have an account? Sign in

Chapter

10 Text Feature Engineering

More Primers

Long-Document Encoding for Filings and Transcripts When Long-Context Encoders Are Worth the Cost