Chapter 2

The Financial Data Universe

4 sections 16 notebooks 25 references Code

Learning Objectives

  • Distinguish among market, fundamental, and alternative data, and explain how dataset definitions shape what each source means in research and trading applications
  • Compare the observability, conventions, and engineering constraints of major asset classes, and identify how market structure changes what can be measured and modeled
  • Apply a financial data quality framework to diagnose common failure modes, especially point-in-time violations, survivorship bias, corporate action errors, and identifier mismatches
  • Conduct vendor due diligence across data quality, legal and compliance, and technical and commercial dimensions
  • Choose storage and query architectures that fit research and production needs, including when to use partitioned files, embedded analytical databases, or server-based systems
Figure 2.1
2.1

A Modern Taxonomy of Financial Data

Financial data is classified into market data, fundamental data, and alternative data, with the emphasis that every dataset embeds hidden definitions — timestamp conventions, adjustment policies, identifier schemes, revision rules — that must be understood before modeling. Market data is presented as a hierarchy from raw event streams to aggregated bars, fundamental data is characterized by release lags and institutional rules that vary by asset class, and alternative data carries a high validation burden around coverage stability, reproducibility, and usage rights. The reader learns to lock down four items before any research: timestamps, corporate action adjustments, identifiers, and revision policy. **Asset-class EDA notebooks** (Sections 2.1-2.2):

2.2

The Asset-Class Market Data Landscape

Eight asset classes are surveyed (equities, ETPs, futures, options, digital assets, FX, fixed income, and commodities), documenting observability levels, key failure modes, and the engineering decisions required for each. The section provides concrete data on market sizes, explains how market structure determines what data is available and how informative it is, and identifies class-specific pitfalls: corporate action discontinuities for equities, roll rules for futures, surface construction for options, venue integrity screening for crypto, and close-convention ambiguity for FX. **Asset-class engineering notebooks**:

12 notebooks

2.3

A Due Diligence Framework for Data Sourcing

A systematic framework for data sourcing addresses four finance-specific failure modes: point-in-time violations, survivorship bias, corporate action errors, and identifier integrity breakdowns. The section details bitemporal storage and as-of query patterns for PIT correctness, quantifies survivorship bias using Monte Carlo simulation on 3,199 US equities (showing 63-109 percentage point distortions from missing delisted stocks), and provides vendor due diligence checklists spanning data quality, legal compliance, and technical reliability. The takeaway: data sourcing is risk management.

4 notebooks

2.4

Storing Data

File formats (Parquet, HDF5, CSV) and database engines (DuckDB, kdb+, ClickHouse, QuestDB, TimescaleDB, InfluxDB) are benchmarked for financial data workflows, measuring file size, write speed, read speed, and ASOF join performance. Parquet emerges as the recommended default for research (3.4x compression vs CSV with fast columnar reads), DuckDB provides SQL analytics over Parquet files without server overhead, and Polars delivers the fastest in-memory ASOF joins (3.8x faster than pandas). The reader gets a decision matrix mapping objectives (research velocity, production reliability, extreme throughput) to recommended storage stacks.

5 notebooks