Chapter 10: Text Feature Engineering

Long-Document Encoding for Filings and Transcripts

For long financial documents, the first design decision is not the model. It is how much context you can afford to preserve without mixing together information that arrives or matters at different times.

Long-Document Encoding for Filings and Transcripts

For long financial documents, the first design decision is not the model. It is how much context you can afford to preserve without mixing together information that arrives or matters at different times.

The Intuition

Financial text is often too long for the simple "one document in, one embedding out" workflow.

A 10-K, earnings-call transcript, or central-bank minutes can exceed the context window of standard encoders by a large margin. Once that happens, every pipeline must answer the same question:

do you truncate, chunk, pool hierarchically, or pay for a long-context model?

That choice changes the feature's meaning.

  • truncation assumes the front of the document carries most of the information
  • chunking assumes local passages can be scored separately and aggregated later
  • hierarchical pooling assumes chunk-level signals matter more than raw token-level continuity
  • long-context models assume preserving global context is worth extra cost and latency

Chapter 10 mentions these strategies. This primer makes the trade-off explicit enough that a reader can choose one on purpose rather than by library default.

The practical destination is a simple decision table: when to truncate, when to chunk, when to pool hierarchically, and when long context is actually worth paying for.

A Decision Table

Constraint Better default
tight latency or broad daily coverage chunking with simple pooling
signal lives in a few local passages chunking with max or attention pooling
signal requires document-wide interaction long-context encoder
most useful content is reliably front-loaded truncation may be acceptable
high leakage risk from mixed timestamps section-aware chunking with explicit availability rules

The point of this table is not to claim one winner. It is to force the design to match the signal hypothesis and the operational budget.

The Failure Mode of Truncation

The simplest long-document strategy is to keep the first L tokens and drop the rest.

This can be acceptable when:

  • the document front-loads the key information
  • the task is coarse classification
  • compute is tight and coverage matters more than completeness

But it is dangerous for many financial texts:

  • risk-factor sections often appear late in filings
  • Q&A in earnings calls can carry the most revealing language
  • footnotes may contain the accounting nuance the headline misses
  • boilerplate at the front can crowd out the distinctive content later

Truncation is therefore not just a compute shortcut. It encodes a hypothesis about where the useful information lives.

Chunking and Overlap

A more flexible strategy is to split the document into chunks, encode each chunk, and then combine the results.

The main knobs are:

  • chunk length
  • overlap
  • chunk-level labeling or scoring rule
  • aggregation rule

Short chunks improve local focus but lose continuity. Long chunks preserve context but are more expensive and blur multiple topics together.

Overlap helps when important content straddles a boundary. If chunk k ends with "we expect gross margin pressure to..." and chunk k+1 begins "...ease only in the second half," no-overlap chunking may split the sentence-level meaning. Overlap reduces boundary loss at the cost of repeated tokens and potential double counting.

In financial NLP, overlap is often most useful for:

  • transcript passages with long spoken sentences
  • filings with section boundaries that do not align with token windows
  • extraction tasks where a key phrase may begin near a boundary

Hierarchical Pooling

Once chunks are encoded, you still need a document-level representation.

Common pooling rules include:

  • mean pooling across chunk embeddings
  • max pooling to surface the strongest chunk signal
  • attention pooling that learns which chunks matter most
  • task-specific pooling, such as taking the last management-guidance chunk or the most extreme risk disclosure score

The first three pool in representation space: they combine chunk embeddings before the downstream prediction head. The last pattern pools in score space: it combines chunk-level outputs after inference. Both are useful, but they answer different implementation questions.

This is where many pipelines quietly become hypothesis-laden. Mean pooling assumes the document's signal is diffuse. Max pooling assumes a few extreme passages matter disproportionately. Attention pooling assumes a learned weighting rule can separate salient chunks from boilerplate.

For filings and transcripts, the pooling rule often matters as much as the base encoder. A model may be perfectly good at chunk-level sentiment but fail at document-level signal construction because the aggregation step averages away the interesting part.

When Long-Context Models Earn Their Cost

Long-context encoders preserve more of the original document structure. That can be valuable when:

  • cross-section relationships depend on widely separated passages
  • the task requires linking an early disclosure with a later qualifier
  • you need document-level reasoning rather than passage scoring

But long context is not automatically better.

The costs are real:

  • slower inference
  • higher memory use
  • lower throughput for large universes
  • more complicated deployment if you need consistent daily coverage

In many trading workflows, chunk-plus-pool beats long context because the target is not "deep document understanding." It is a stable, timely feature. If a pooled chunk model captures the same cross-sectional ordering at one-fifth the cost, that is usually the better engineering choice.

So the decision rule is simple:

  • choose long context when document-wide interactions are part of the hypothesis
  • choose chunking and pooling when the signal is mostly local and broad coverage matters

A Worked Scenario

Suppose you want a "management confidence" feature from earnings-call transcripts.

The decision table later in the primer says this task should push you toward chunking with learned pooling: the signal lives in a few local passages, coverage matters, and full long-context reasoning is rarely necessary. There are three plausible pipelines.

1. Truncate to the first 512 tokens

This is cheap and typically captures the prepared remarks in many transcript formats. It likely misses the Q&A, where analyst challenge and management hesitation become visible.

2. Chunk with overlap and mean-pool

Split the transcript into overlapping windows, score each chunk, and average the chunk embeddings or confidence scores. If you average embeddings, you are pooling in representation space before the prediction head. If you average calibrated chunk scores, you are pooling in score space after inference. This captures more of the document, but it can wash out a few highly informative moments.

3. Chunk with overlap and attention-pool

Encode all chunks, then learn a weighting scheme over chunk embeddings. This preserves broad coverage while allowing the model to emphasize the sections that actually move the downstream label. This is representation-space pooling. If your chunk outputs are already calibrated event scores, a score-space rule such as max or weighted averaging can be simpler and more interpretable.

For this task, the third option is often the practical sweet spot. It is cheaper than a long-context model, more faithful than truncation, and better aligned with the idea that only some parts of the call carry the real signal.

Time and Leakage Considerations

Long-document pipelines create their own leakage risks.

The main ones are:

  • mixing text that was not available at the same decision time
  • using revised transcripts in historical backtests
  • pooling across sections whose timestamps differ materially

That is why long-document evaluation should always check:

  • whether the chosen strategy preserves the document regions the hypothesis depends on
  • whether chunk boundaries or truncation alter the downstream rank ordering materially
  • whether the additional compute buys incremental information rather than only prettier embeddings

In Practice

Use these rules:

  • start with chunking and explicit pooling before paying for long context
  • choose overlap only when boundary loss is a real risk, not as a reflex
  • treat pooling as part of the hypothesis, not as a generic plumbing step
  • audit where the retained information comes from: front matter, Q&A, footnotes, or repeated boilerplate
  • measure whether the more expensive document strategy actually improves the downstream feature, not just the document embedding

Common Mistakes

  • Truncating by default and pretending the missing tail does not matter.
  • Using overlap so large that repeated text dominates the aggregate.
  • Averaging chunk signals when the task is driven by a few salient passages.
  • Buying a long-context model for a task that only needs passage-level scoring.
  • Mixing sections with different availability times into one supposedly point-in-time feature.

Connections

This primer supports Chapter 10's practical NLP workflow. It connects directly to self-attention, domain adaptation, point-in-time-safe text pipelines, and coverage-aware evaluation, because long documents become useful features only when the encoding strategy, timestamp logic, and downstream aggregation all fit together.

Register to Read

Sign up for a free account to access all 61 primer articles.

Create Free Account

Already have an account? Sign in