Home
/
Chapters
/
Part 2
/
Chapter 10

Chapter 10

Text Feature Engineering

5 sections 16 notebooks 20 references Code

Learning Objectives

Distinguish lexical features, static embeddings, sequential models, and Transformers in terms of the information each representation preserves and loses
Explain how Transformer self-attention produces contextual embeddings and why this resolves key limitations of earlier NLP methods, including polysemy and long-range dependence
Apply a practical financial NLP workflow that combines pre-trained checkpoints, domain adaptation when needed, and task fine-tuning for classification or extraction tasks
Design text-derived features such as sentiment, narrative surprise, or structured event signals using point-in-time-safe timestamps, model cutoffs, and aggregation rules
Evaluate text-derived signals using horizon-aware diagnostics, coverage-aware analysis, and event-time alignment rather than benchmark accuracy alone
Use token-level attribution and related diagnostics to audit, debug, and stress-test NLP features before deployment

10.1

Lexical and Statistical Models

1 notebook

10.2

Static Embeddings

2 notebooks

10.3

Sequential Models

10.4

Transformers

2 notebooks

10.5

The Modern Feature Extraction Workflow

4 notebooks

03 Sentiment Evolution

01 Word2Vec Training

02 Asset Embeddings

04 Bert Finetuning

06 Finbert Cross Dataset

05 Financial Ner Finetuning

07 News Return Signals

08 Text Feature Evaluation

09 Filing Text Signals

3 primers providing foundational concepts for this chapter.

Coverage-Aware Evaluation and Event-Time Alignment for Text Signals

A text model is not useful because it predicts labels accurately. It is useful only if its signal is available when you trade, on enough names, at the horizon that matters.

Long-Document Encoding for Filings and Transcripts

For long financial documents, the first design decision is not the model. It is how much context you can afford to preserve without mixing together information that arrives or matters at different times.

When Long-Context Encoders Are Worth the Cost

The decision between chunking and full-context encoding is a cost-accuracy tradeoff governed by document structure and task type -- principles that outlast any specific architecture generation.

Browse All Primers

20 references cited in this chapter.

Giving Content to Investor Sentiment: The Role of Media in the Stock Market

Paul C. Tetlock (2005) · 4065 citations

When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks

Tim Loughran and Bill Mcdonald (2011) — The Journal of Finance · 711 citations

Generic sentiment dictionaries (e.g., Harvard-IV negative words) badly misclassify tone in 10-Ks, so the paper builds finance-specific dictionaries and shows they better link 10-K language to market reactions and firm outcomes.

Efficient estimation of word representations in vector space

Tomas Mikolov et al. (2013) — arXiv preprint arXiv:1301.3781 · 33959 citations

The paper introduces CBOW and Skip-gram—two simple, fast neural architectures that learn high-quality word embeddings from billions of tokens and exhibit strong linear “analogy” structure (e.g., king − man + woman ≈ queen).

GloVe: Global Vectors for Word Representation

Jeffrey Pennington et al. (2014) — Association for Computational Linguistics · 34211 citations

A Unified Approach to Interpreting Model Predictions

Scott M Lundberg et al. (2017) — Curran Associates, Inc.

This paper proposes SHAP, a unified framework showing that many popular explanation methods are all approximations to a single, uniquely justified set of feature attributions (Shapley values) with desirable properties like local accuracy and consistency.

Attention Is All You Need

Ashish Vaswani et al. (2017) — arXiv:1706.03762 [cs] · 171159 citations

This paper introduces the Transformer, a sequence-to-sequence model that replaces recurrence and convolutions with multi-head self-attention, achieving state-of-the-art translation quality with much faster, highly parallel training.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych (2019) — arXiv:1908.10084 [cs] · 16856 citations

SBERT modifies BERT into a siamese/triplet architecture to produce cosine-comparable sentence embeddings that make semantic search and clustering practical (seconds instead of hours) while retaining strong accuracy on STS-style tasks.

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

Dogu Araci (2019) · 895 citations

This paper introduces FinBERT, a BERT-based language model further pre-trained on financial text, and demonstrates its state-of-the-art performance on financial sentiment analysis tasks, outperforming existing methods and highlighting the benefits of transfer learning in finance.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin et al. (2019) — Association for Computational Linguistics · 112230 citations

BERT introduces a Transformer encoder pre-trained with masked-token prediction (and next-sentence prediction) to learn deep bidirectional language representations that can be fine-tuned with minimal task-specific changes to achieve state-of-the-art NLP results.

Textual Analysis in Finance

Tim Loughran and Bill McDonald (2020) — Annual Review of Financial Economics · 169 citations

A practitioner-oriented review of how finance uses text (social media, politics, fraud) that argues “readability” metrics like the Fog Index are mis-specified for 10-Ks and should be replaced by text-based measures of firm complexity.

FinBERT—A Deep Learning Approach to Extracting Textual Information

Allen Huang et al. (2020) — SSRN Electronic Journal · 13 citations

The paper builds FinBERT, a finance-domain BERT model, and shows it measures sentiment in financial text far more accurately than common dictionary and bag-of-words methods—materially improving event-study and earnings-prediction inferences.

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu et al. (2021)

BloombergGPT: A Large Language Model for Finance

Shijie Wu et al. (2023) · 1242 citations

Quantifying Narratives and Their Impact on Financial Markets

Rajeev Bhargava et al. (2023) — The Journal of Portfolio Management · 4 citations

The paper builds daily, media-based “narrative” indicators (coverage intensity + sentiment) and shows they explain and sometimes predict market moves, and can be used to improve asset allocation and to build portfolios with explicit exposure to a narrative (e.g., COVID-19 recovery).

Finben: A holistic financial benchmark for large language models

Qianqian Xie et al. (2024) — Advances in Neural Information Processing Systems · 123 citations

FinBen is a large open-source benchmark (42 datasets, 24 tasks) designed to measure what today’s LLMs can and cannot do in real financial NLP, forecasting, risk, and trading/agent settings.

Business News and Business Cycles

Leland Bybee et al. (2024) — The Journal of Finance · 84 citations

This paper introduces a novel approach to measuring the state of the economy by analyzing the full text of business news articles from the *Wall Street Journal* using topic modeling, demonstrating its ability to track economic activities and forecast stock market returns and macroeconomic dynamics.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner et al. (2024) · 506 citations

Asset Embeddings

Xavier Gabaix et al. (2025) · 4 citations

This paper introduces "asset embeddings" derived from institutional portfolio holdings using AI/ML techniques, demonstrating their superior ability to predict relative valuations, explain return comovement, and forecast portfolio decisions compared to traditional firm characteristics or generic text-based embeddings.

LSTM can solve hard long time lag problems

Sepp Hochreiter and Jürgen Schmidhuber — MIT Press · 1063 citations

This paper introduces Long Short-Term Memory (LSTM) networks and demonstrates their ability to solve complex sequence learning tasks with long time lags, outperforming traditional recurrent neural networks and other methods.

The Probabilistic Relevance Framework: BM25 and Beyond

Stephen Robertson and Hugo Zaragoza — Found. Trends Inf. Retr. · 4941 citations

All Chapters