Chapter 4

Fundamental and Alternative Data

5 sections 13 notebooks 17 references Code

Learning Objectives

  • Explain why point-in-time correctness and entity consistency are the core engineering constraints for fundamental and alternative data.
  • Implement bitemporal storage and as-of query patterns for revision-prone financial datasets.
  • Build a point-in-time corporate fundamentals pipeline from SEC EDGAR and XBRL filing histories.
  • Design time-valid entity, security, and contract mapping workflows using deterministic, probabilistic, and embedding-based resolution methods with appropriate QA gates.
  • Apply point-in-time alignment rules to macro, commodity, and on-chain datasets, including release timestamps, vintages, contract mapping, and finality policies.
  • Evaluate alternative datasets for incremental signal, data quality, legal and compliance risk, and commercial or engineering feasibility.
  • Extract, clean, and store SEC filing text as an auditable point-in-time corpus for downstream NLP feature engineering.
Figure 4.1
4.1

The Point-in-Time Pipeline

The section operationalizes point-in-time correctness for fundamental data by building a bitemporal pipeline from SEC EDGAR filings. The core challenge is that amended filings, taxonomy changes, and corporate actions create multiple versions of the same reporting period — a company restating Q3 EPS months later creates lookahead bias if the restated value is used at the original date. The reader learns as-of query logic for reconstructing what was known at any historical decision time, and authoritative timestamp conventions for SEC, macro, commodity, and crypto data.

1 notebook

4.2

Entity Resolution and Mapping

A three-stage hierarchical approach to entity resolution: deterministic matching on standard identifiers (LEI, CIK, FIGI, CUSIP/ISIN), probabilistic matching using string similarity algorithms for sources lacking identifiers, and embedding-based semantic matching for complex cases like subsidiary-to-parent linking. A false-positive match poisons every downstream join, making precision more important than recall. The reader learns to build a temporally-aware master security database with time-valid identifier mappings and ongoing QA metrics.

5 notebooks

4.3

Fundamentals Across the Asset-Class Spectrum

Point-in-time engineering is generalized beyond equities to macro/sovereign data, commodities, and crypto. Each asset class has revision-prone fundamentals with distinct timestamp authorities: macro data uses FRED's ALFRED vintage system for revision histories, commodity releases (EIA, USDA WASDE, CFTC COT) must be mapped to the correct tradable contract using consistent roll logic, and crypto on-chain fundamentals (active addresses, hash rate, TVL) have unique PIT requirements around block finality.

3 notebooks

4.4

Alternative Data: From Evaluation to Integration

A structured evaluation framework for alternative data acquisition, organized around signal content (uniqueness, decay, incrementality), data quality (versioning, coverage, latency), legal risks as a hard-fail gate (MNPI screening, privacy compliance), and commercial costs. Published return predictors lose ~58% of performance post-publication (McLean and Pontiff 2016), establishing a decay baseline. The key decision rule: privilege hard constraints over marginal score differences and only compare datasets that clear hard-fail gates.

4 notebooks

4.5

Case Study: Text Data for NLP Features

The engineering pipeline for extracting model-ready text from SEC 10-K filings, focusing on MD&A (Item 7) and Risk Factors (Item 1A) as sections containing qualitative information beyond accounting line items. The four-step pipeline covers document selection, section extraction with quality checks, cleaning that preserves paragraph structure, and PIT-correct storage using accession numbers with both filing dates and SEC acceptance timestamps. This corpus serves as input for Chapter 10's NLP feature engineering.

1 notebook