Distinguish lexical features, static embeddings, sequential models, and Transformers in terms of the information each representation preserves and loses
Explain how Transformer self-attention produces contextual embeddings and why this resolves key limitations of earlier NLP methods, including polysemy and long-range dependence
Apply a practical financial NLP workflow that combines pre-trained checkpoints, domain adaptation when needed, and task fine-tuning for classification or extraction tasks
Design text-derived features such as sentiment, narrative surprise, or structured event signals using point-in-time-safe timestamps, model cutoffs, and aggregation rules
Evaluate text-derived signals using horizon-aware diagnostics, coverage-aware analysis, and event-time alignment rather than benchmark accuracy alone
Use token-level attribution and related diagnostics to audit, debug, and stress-test NLP features before deployment
10.1
Lexical and Statistical Models
1 notebook
10.2
Static Embeddings
2 notebooks
10.3
Sequential Models
10.4
Transformers
2 notebooks
10.5
The Modern Feature Extraction Workflow
4 notebooks
03 Sentiment Evolution
01 Word2Vec Training
02 Asset Embeddings
04 Bert Finetuning
06 Finbert Cross Dataset
05 Financial Ner Finetuning
07 News Return Signals
08 Text Feature Evaluation
09 Filing Text Signals
3 primers providing foundational concepts for this chapter.
Tim Loughran and Bill Mcdonald (2011) — The Journal of Finance · 711 citations
Generic sentiment dictionaries (e.g., Harvard-IV negative words) badly misclassify tone in 10-Ks, so the paper builds finance-specific dictionaries and shows they better link 10-K language to market reactions and firm outcomes.
Tomas Mikolov et al. (2013) — arXiv preprint arXiv:1301.3781 · 33959 citations
The paper introduces CBOW and Skip-gram—two simple, fast neural architectures that learn high-quality word embeddings from billions of tokens and exhibit strong linear “analogy” structure (e.g., king − man + woman ≈ queen).
Scott M Lundberg et al. (2017) — Curran Associates, Inc.
This paper proposes SHAP, a unified framework showing that many popular explanation methods are all approximations to a single, uniquely justified set of feature attributions (Shapley values) with desirable properties like local accuracy and consistency.
This paper introduces the Transformer, a sequence-to-sequence model that replaces recurrence and convolutions with multi-head self-attention, achieving state-of-the-art translation quality with much faster, highly parallel training.
Nils Reimers and Iryna Gurevych (2019) — arXiv:1908.10084 [cs] · 16856 citations
SBERT modifies BERT into a siamese/triplet architecture to produce cosine-comparable sentence embeddings that make semantic search and clustering practical (seconds instead of hours) while retaining strong accuracy on STS-style tasks.
This paper introduces FinBERT, a BERT-based language model further pre-trained on financial text, and demonstrates its state-of-the-art performance on financial sentiment analysis tasks, outperforming existing methods and highlighting the benefits of transfer learning in finance.
Jacob Devlin et al. (2019) — Association for Computational Linguistics · 112230 citations
BERT introduces a Transformer encoder pre-trained with masked-token prediction (and next-sentence prediction) to learn deep bidirectional language representations that can be fine-tuned with minimal task-specific changes to achieve state-of-the-art NLP results.
Tim Loughran and Bill McDonald (2020) — Annual Review of Financial Economics · 169 citations
A practitioner-oriented review of how finance uses text (social media, politics, fraud) that argues “readability” metrics like the Fog Index are mis-specified for 10-Ks and should be replaced by text-based measures of firm complexity.
Allen Huang et al. (2020) — SSRN Electronic Journal · 13 citations
The paper builds FinBERT, a finance-domain BERT model, and shows it measures sentiment in financial text far more accurately than common dictionary and bag-of-words methods—materially improving event-study and earnings-prediction inferences.
Rajeev Bhargava et al. (2023) — The Journal of Portfolio Management · 4 citations
The paper builds daily, media-based “narrative” indicators (coverage intensity + sentiment) and shows they explain and sometimes predict market moves, and can be used to improve asset allocation and to build portfolios with explicit exposure to a narrative (e.g., COVID-19 recovery).
Qianqian Xie et al. (2024) — Advances in Neural Information Processing Systems · 123 citations
FinBen is a large open-source benchmark (42 datasets, 24 tasks) designed to measure what today’s LLMs can and cannot do in real financial NLP, forecasting, risk, and trading/agent settings.
Leland Bybee et al. (2024) — The Journal of Finance · 84 citations
This paper introduces a novel approach to measuring the state of the economy by analyzing the full text of business news articles from the *Wall Street Journal* using topic modeling, demonstrating its ability to track economic activities and forecast stock market returns and macroeconomic dynamics.
This paper introduces "asset embeddings" derived from institutional portfolio holdings using AI/ML techniques, demonstrating their superior ability to predict relative valuations, explain return comovement, and forecast portfolio decisions compared to traditional firm characteristics or generic text-based embeddings.
LSTM can solve hard long time lag problems
Sepp Hochreiter and Jürgen Schmidhuber — MIT Press · 1063 citations
This paper introduces Long Short-Term Memory (LSTM) networks and demonstrates their ability to solve complex sequence learning tasks with long time lags, outperforming traditional recurrent neural networks and other methods.