Chapter 10: Text Feature Engineering

When Long-Context Encoders Are Worth the Cost

The decision between chunking and full-context encoding is a cost-accuracy tradeoff governed by document structure and task type -- principles that outlast any specific architecture generation.

When Long-Context Encoders Are Worth the Cost

The decision between chunking and full-context encoding is a cost-accuracy tradeoff governed by document structure and task type -- principles that outlast any specific architecture generation.

Why This Matters

Financial NLP pipelines face a recurring infrastructure question: should you chunk a 40,000-token 10-K filing into pieces, encode each chunk independently, and pool the results (Primer 04), or should you use a long-context encoder that processes the entire document in a single pass? The first approach is cheaper but loses cross-section dependencies. The second preserves global coherence but costs substantially more per document.

Most treatments mention long-document encoding options but do not provide a decision framework for when the additional cost is justified. This primer fills that gap with a cost-modeling approach that helps practitioners decide before committing to infrastructure. The framework is deliberately architecture-agnostic: the quadratic attention bottleneck, sparse-attention principle, and cost-accuracy tradeoff are durable concepts that apply regardless of which specific models are available.

Intuition

Standard self-attention lets every token attend to every other token, which is what gives transformers their power over sequential models. But this "everyone talks to everyone" pattern has a cost: for a sequence of $n$ tokens, the model computes $n^2$ attention scores. Double the sequence length and you quadruple the compute.

A 512-token earnings transcript excerpt requires roughly 262,000 attention computations. A 40,000-token 10-K filing requires 1.6 billion -- a 6,000-fold increase. This is why vanilla transformers truncate long documents rather than processing them whole.

Long-context architectures break this barrier by restricting which tokens can attend to which. Instead of every token attending to every other, each token attends to its local neighborhood (a sliding window) plus a handful of designated global tokens that aggregate information across the full document. The intuition is that most meaning is local -- a sentence about revenue growth primarily needs its surrounding paragraph -- but some meaning is global, like detecting contradictions between the risk factors section and the MD&A.

Formal Core

Standard self-attention computes, for a sequence of $n$ tokens with embedding dimension $d$, the attention matrix:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right) V$$

where $Q, K, V \in \mathbb{R}^{n \times d}$. The $QK^\top$ product requires $O(n^2 d)$ computation and $O(n^2)$ memory for the attention weights, as noted by Vaswani et al. (2017) in the original transformer paper [ref:HV78MUEQ].

Sparse attention patterns reduce this cost by restricting the set of key-value pairs each query attends to. Let $\mathcal{S}(i) \subseteq \{1, \ldots, n\}$ denote the attention neighborhood for token $i$. If $|\mathcal{S}(i)| = w$ for all $i$ (a fixed-width local window), the cost drops to $O(nwd)$ -- linear in sequence length. The three common sparse patterns are:

Local sliding window: $\mathcal{S}(i) = \{j : |i - j| \leq w/2\}$, capturing nearby context.
Global tokens: a small set of $g$ designated positions that attend to all tokens and are attended to by all tokens, providing document-level aggregation at cost $O(gn)$.
Dilated attention: windows with gaps, extending the effective receptive field without increasing window size.

The practical cost per document is approximately:

$$\text{Cost}(n) \approx \begin{cases} \alpha \cdot n^2 \cdot d \cdot \ell & \text{full attention} \\ \alpha \cdot n \cdot (w + g) \cdot d \cdot \ell & \text{sparse attention} \end{cases}$$

where $\ell$ is the number of layers and $\alpha$ captures hardware-specific throughput. The ratio of sparse to full cost is $(w + g) / n$, which for a 40,000-token document with $w = 512$ and $g = 64$ is roughly 1.4% of the full-attention cost.

How It Works in Practice

When chunk-and-pool wins

For tasks where document-level coherence matters less than section-level signals, hierarchical chunk-then-pool (Primer 04) often matches or beats a single long-context pass at a fraction of the cost. Concrete examples:

Item-level sentiment in 10-K filings: the sentiment of Item 1A (Risk Factors) is largely self-contained. Encoding it as a chunk captures the relevant signal without needing to attend to Item 7 (MD&A).
Speaker-turn classification in transcripts: whether a CEO response is positive or evasive can usually be assessed from the response and its preceding question, not the entire call.
Batch scoring at scale: Baltussen et al. (2025) describe pipelines processing 230,000 filings; at this scale, the cost difference between chunk-and-pool and full-context encoding is the difference between a workstation and a GPU cluster [ref:AT9Y26G9].

Distilled models like DistilBERT, which retains 97% of BERT's language understanding while being 60% faster and 40% smaller [ref:86DRWBDS], applied to chunks can be substantially cheaper per document than a single long-context encoder pass.

When full context wins

Tasks requiring cross-section reasoning benefit from architectures that attend across the entire document:

Contradiction detection: identifying inconsistencies between the risk factors and management's discussion requires the model to compare statements that may be thousands of tokens apart.
Narrative threading: tracking how a specific topic (e.g., supply chain risk) develops across multiple sections of a filing requires global attention.
Whole-document classification: when the label applies to the entire document and depends on the interaction between sections, a single encoding pass avoids the information bottleneck of chunk-level pooling.

Cost estimation framework

Before choosing an architecture, estimate the per-document cost under each approach:

$ # Cost comparison pseudocode def cost_per_document(n_tokens, approach, model_params): if approach == "chunk_and_pool": n_chunks = ceil(n_tokens / chunk_size) cost = n_chunks * chunk_size**2 * model_params["layers"] * model_params["dim"] elif approach == "sparse_long_context": window = model_params["window_size"] global_tokens = model_params["global_tokens"] cost = n_tokens * (window + global_tokens) * model_params["layers"] * model_params["dim"] return cost * model_params["cost_per_flop"] # Example: 40,000-token 10-K chunk_cost = cost_per_document(40000, "chunk_and_pool", base_model) long_cost = cost_per_document(40000, "sparse_long_context", long_model) ratio = long_cost / chunk_cost # Typically 3-10x $

The key question is whether the accuracy improvement from full-context encoding justifies this cost multiple across the entire document universe. FinMTEB results demonstrate that domain-adapted smaller models can outperform larger general-purpose ones on financial tasks, suggesting that model selection matters as much as context length [ref:FBHLRRYJ].

Worked Example

A pipeline scores 5,000 annual 10-K filings (average 45,000 tokens each) for narrative tone. Two approaches:

Approach	Model	Tokens processed	Relative cost	Cross-section attention
Chunk-and-pool	SBERT (512-token)	5,000 x 88 chunks x 512	1x (baseline)	None
Sparse long-context	Long encoder (4,096 window)	5,000 x 45,000	~5x	Within 4,096-token window + global

For section-level sentiment (the most common use case), the chunk-and-pool approach produces comparable scores at one-fifth the cost. For a cross-section consistency check -- flagging filings where the risk-factor tone contradicts the MD&A tone -- the long-context approach catches contradictions that chunk-and-pool misses, because the relevant sections fall in different chunks with no attention path between them.

The practical recommendation: use chunk-and-pool as the default for section-level features, and reserve long-context encoding for specific tasks where cross-section reasoning has been shown to improve the downstream signal.

Practical Guidance

Start with chunk-and-pool. It is cheaper, simpler to debug, and produces strong baselines for most financial text tasks. Only upgrade to long-context encoding if you have evidence that cross-section attention improves the specific feature you are constructing.

Profile your cost curve before scaling. Run the cost estimation on a representative sample of 100 documents before committing to a full backtest. The cost ratio between approaches can vary from 3x to 20x depending on document length distribution.

Consider memory-efficient inference orthogonally. Techniques like gradient checkpointing, kernel-fused attention, and quantization reduce the cost of any architecture. Apply them regardless of whether you choose chunk-and-pool or full-context encoding. They are additive cost savings, not substitutes for the architectural decision.

Domain adaptation trumps context length. A finance-adapted 512-token model often outperforms a general-purpose 4,096-token model on financial tasks [ref:FBHLRRYJ]. Before investing in long-context infrastructure, verify that your base model is well-adapted to financial language.

Where It Fits in ML4T

Chapter 10 introduces the representation progression from lexical features through transformers and discusses model-selection tradeoffs including speed, context length, and bi-encoder versus cross-encoder architectures. This primer provides the cost-modeling framework that makes the context-length decision rigorous.

The decision made here depends on the chunking strategy from Primer 04 (which defines the fallback option) and feeds into Primer 06 (embedding-based surprise factors, where consistent document representations matter more than absolute accuracy on any single filing). Point-in-time discipline from Primer 01 applies to both approaches: the model checkpoint and the document version must respect availability timestamps regardless of encoding strategy.

Register to Read

Create Free Account

Already have an account? Sign in

Chapter

10 Text Feature Engineering

More Primers

Coverage-Aware Evaluation and Event-Time Alignment for Text Signals Long-Document Encoding for Filings and Transcripts