Chapter 10

Text Feature Engineering

5 sections 16 notebooks 20 references Code

Learning Objectives

  • Distinguish lexical features, static embeddings, sequential models, and Transformers in terms of the information each representation preserves and loses
  • Explain how Transformer self-attention produces contextual embeddings and why this resolves key limitations of earlier NLP methods, including polysemy and long-range dependence
  • Apply a practical financial NLP workflow that combines pre-trained checkpoints, domain adaptation when needed, and task fine-tuning for classification or extraction tasks
  • Design text-derived features such as sentiment, narrative surprise, or structured event signals using point-in-time-safe timestamps, model cutoffs, and aggregation rules
  • Evaluate text-derived signals using horizon-aware diagnostics, coverage-aware analysis, and event-time alignment rather than benchmark accuracy alone
  • Use token-level attribution and related diagnostics to audit, debug, and stress-test NLP features before deployment
Figure 10.1
10.1

Lexical and Statistical Models

1 notebook

10.2

Static Embeddings

2 notebooks

10.3

Sequential Models

10.4

Transformers

2 notebooks

10.5

The Modern Feature Extraction Workflow

4 notebooks