Chapter 10: Text Feature Engineering

Long-Document Encoding for Filings and Transcripts

For long financial documents, the first design decision is not the model. It is how much context you can afford to preserve without mixing together information that arrives or matters at different times.

Long-Document Encoding for Filings and Transcripts

For long financial documents, the first design decision is not the model. It is how much context you can afford to preserve without mixing together information that arrives or matters at different times.

The Intuition

Financial text is often too long for the simple "one document in, one embedding out" workflow.

A 10-K, earnings-call transcript, or central-bank minutes can exceed the context window of standard encoders by a large margin. Once that happens, every pipeline must answer the same question:

do you truncate, chunk, pool hierarchically, or pay for a long-context model?

That choice changes the feature's meaning.

truncation assumes the front of the document carries most of the information
chunking assumes local passages can be scored separately and aggregated later
hierarchical pooling assumes chunk-level signals matter more than raw token-level continuity
long-context models assume preserving global context is worth extra cost and latency

Chapter 10 mentions these strategies. This primer makes the trade-off explicit enough that a reader can choose one on purpose rather than by library default.

The practical destination is a simple decision table: when to truncate, when to chunk, when to pool hierarchically, and when long context is actually worth paying for.

A Decision Table

Constraint	Better default
tight latency or broad daily coverage	chunking with simple pooling
signal lives in a few local passages	chunking with max or attention pooling
signal requires document-wide interaction	long-context encoder
most useful content is reliably front-loaded	truncation may be acceptable
high leakage risk from mixed timestamps	section-aware chunking with explicit availability rules

The point of this table is not to claim one winner. It is to force the design to match the signal hypothesis and the operational budget.

The Failure Mode of Truncation

The simplest long-document strategy is to keep the first L tokens and drop the rest.

This can be acceptable when:

the document front-loads the key information
the task is coarse classification
compute is tight and coverage matters more than completeness

But it is dangerous for many financial texts:

risk-factor sections often appear late in filings
Q&A in earnings calls can carry the most revealing language
footnotes may contain the accounting nuance the headline misses
boilerplate at the front can crowd out the distinctive content later

Truncation is therefore not just a compute shortcut. It encodes a hypothesis about where the useful information lives.

Chunking and Overlap

A more flexible strategy is to split the document into chunks, encode each chunk, and then combine the results.

The main knobs are:

chunk length
overlap
chunk-level labeling or scoring rule
aggregation rule

Short chunks improve local focus but lose continuity. Long chunks preserve context but are more expensive and blur multiple topics together.

Overlap helps when important content straddles a boundary. If chunk k ends with "we expect gross margin pressure to..." and chunk k+1 begins "...ease only in the second half," no-overlap chunking may split the sentence-level meaning. Overlap reduces boundary loss at the cost of repeated tokens and potential double counting.

In financial NLP, overlap is often most useful for:

transcript passages with long spoken sentences
filings with section boundaries that do not align with token windows
extraction tasks where a key phrase may begin near a boundary

Hierarchical Pooling

Once chunks are encoded, you still need a document-level representation.

Common pooling rules include:

mean pooling across chunk embeddings
max pooling to surface the strongest chunk signal
attention pooling that learns which chunks matter most
task-specific pooling, such as taking the last management-guidance chunk or the most extreme risk disclosure score

The first three pool in representation space: they combine chunk embeddings before the downstream prediction head. The last pattern pools in score space: it combines chunk-level outputs after inference. Both are useful, but they answer different implementation questions.

This is where many pipelines quietly become hypothesis-laden. Mean pooling assumes the document's signal is diffuse. Max pooling assumes a few extreme passages matter disproportionately. Attention pooling assumes a learned weighting rule can separate salient chunks from boilerplate.

For filings and transcripts, the pooling rule often matters as much as the base encoder. A model may be perfectly good at chunk-level sentiment but fail at document-level signal construction because the aggregation step averages away the interesting part.

When Long-Context Models Earn Their Cost

Long-context encoders preserve more of the original document structure. That can be valuable when:

cross-section relationships depend on widely separated passages
the task requires linking an early disclosure with a later qualifier
you need document-level reasoning rather than passage scoring

But long context is not automatically better.

The costs are real:

slower inference
higher memory use
lower throughput for large universes
more complicated deployment if you need consistent daily coverage

In many trading workflows, chunk-plus-pool beats long context because the target is not "deep document understanding." It is a stable, timely feature. If a pooled chunk model captures the same cross-sectional ordering at one-fifth the cost, that is usually the better engineering choice.

So the decision rule is simple:

choose long context when document-wide interactions are part of the hypothesis
choose chunking and pooling when the signal is mostly local and broad coverage matters

A Worked Scenario

Suppose you want a "management confidence" feature from earnings-call transcripts.

The decision table later in the primer says this task should push you toward chunking with learned pooling: the signal lives in a few local passages, coverage matters, and full long-context reasoning is rarely necessary. There are three plausible pipelines.

1. Truncate to the first 512 tokens

This is cheap and typically captures the prepared remarks in many transcript formats. It likely misses the Q&A, where analyst challenge and management hesitation become visible.

2. Chunk with overlap and mean-pool

Split the transcript into overlapping windows, score each chunk, and average the chunk embeddings or confidence scores. If you average embeddings, you are pooling in representation space before the prediction head. If you average calibrated chunk scores, you are pooling in score space after inference. This captures more of the document, but it can wash out a few highly informative moments.

3. Chunk with overlap and attention-pool

Encode all chunks, then learn a weighting scheme over chunk embeddings. This preserves broad coverage while allowing the model to emphasize the sections that actually move the downstream label. This is representation-space pooling. If your chunk outputs are already calibrated event scores, a score-space rule such as max or weighted averaging can be simpler and more interpretable.

For this task, the third option is often the practical sweet spot. It is cheaper than a long-context model, more faithful than truncation, and better aligned with the idea that only some parts of the call carry the real signal.

Time and Leakage Considerations

Long-document pipelines create their own leakage risks.

The main ones are:

mixing text that was not available at the same decision time
using revised transcripts in historical backtests
pooling across sections whose timestamps differ materially

That is why long-document evaluation should always check:

whether the chosen strategy preserves the document regions the hypothesis depends on
whether chunk boundaries or truncation alter the downstream rank ordering materially
whether the additional compute buys incremental information rather than only prettier embeddings

In Practice

Use these rules:

start with chunking and explicit pooling before paying for long context
choose overlap only when boundary loss is a real risk, not as a reflex
treat pooling as part of the hypothesis, not as a generic plumbing step
audit where the retained information comes from: front matter, Q&A, footnotes, or repeated boilerplate
measure whether the more expensive document strategy actually improves the downstream feature, not just the document embedding

Common Mistakes

Truncating by default and pretending the missing tail does not matter.
Using overlap so large that repeated text dominates the aggregate.
Averaging chunk signals when the task is driven by a few salient passages.
Buying a long-context model for a task that only needs passage-level scoring.
Mixing sections with different availability times into one supposedly point-in-time feature.

Connections

This primer supports Chapter 10's practical NLP workflow. It connects directly to self-attention, domain adaptation, point-in-time-safe text pipelines, and coverage-aware evaluation, because long documents become useful features only when the encoding strategy, timestamp logic, and downstream aggregation all fit together.

Register to Read

Create Free Account

Already have an account? Sign in

Chapter

10 Text Feature Engineering

More Primers

Coverage-Aware Evaluation and Event-Time Alignment for Text Signals When Long-Context Encoders Are Worth the Cost