Distinguish among market, fundamental, and alternative data, and explain how dataset definitions shape what each source means in research and trading applications
Compare the observability, conventions, and engineering constraints of major asset classes, and identify how market structure changes what can be measured and modeled
Apply a financial data quality framework to diagnose common failure modes, especially point-in-time violations, survivorship bias, corporate action errors, and identifier mismatches
Conduct vendor due diligence across data quality, legal and compliance, and technical and commercial dimensions
Choose storage and query architectures that fit research and production needs, including when to use partitioned files, embedded analytical databases, or server-based systems
2.1
A Modern Taxonomy of Financial Data
Financial data is classified into market data, fundamental data, and alternative data, with the emphasis that every dataset embeds hidden definitions — timestamp conventions, adjustment policies, identifier schemes, revision rules — that must be understood before modeling. Market data is presented as a hierarchy from raw event streams to aggregated bars, fundamental data is characterized by release lags and institutional rules that vary by asset class, and alternative data carries a high validation burden around coverage stability, reproducibility, and usage rights. The reader learns to lock down four items before any research: timestamps, corporate action adjustments, identifiers, and revision policy. **Asset-class EDA notebooks** (Sections 2.1-2.2):
2.2
The Asset-Class Market Data Landscape
Eight asset classes are surveyed (equities, ETPs, futures, options, digital assets, FX, fixed income, and commodities), documenting observability levels, key failure modes, and the engineering decisions required for each. The section provides concrete data on market sizes, explains how market structure determines what data is available and how informative it is, and identifies class-specific pitfalls: corporate action discontinuities for equities, roll rules for futures, surface construction for options, venue integrity screening for crypto, and close-convention ambiguity for FX. **Asset-class engineering notebooks**:
12 notebooks
2.3
A Due Diligence Framework for Data Sourcing
A systematic framework for data sourcing addresses four finance-specific failure modes: point-in-time violations, survivorship bias, corporate action errors, and identifier integrity breakdowns. The section details bitemporal storage and as-of query patterns for PIT correctness, quantifies survivorship bias using Monte Carlo simulation on 3,199 US equities (showing 63-109 percentage point distortions from missing delisted stocks), and provides vendor due diligence checklists spanning data quality, legal compliance, and technical reliability. The takeaway: data sourcing is risk management.
4 notebooks
2.4
Storing Data
File formats (Parquet, HDF5, CSV) and database engines (DuckDB, kdb+, ClickHouse, QuestDB, TimescaleDB, InfluxDB) are benchmarked for financial data workflows, measuring file size, write speed, read speed, and ASOF join performance. Parquet emerges as the recommended default for research (3.4x compression vs CSV with fast columnar reads), DuckDB provides SQL analytics over Parquet files without server overhead, and Polars delivers the fastest in-memory ASOF joins (3.8x faster than pandas). The reader gets a decision matrix mapping objectives (research velocity, production reliability, extreme throughput) to recommended storage stacks.
5 notebooks
01 Us Equities Eda
This notebook introduces the Wiki Prices dataset - a survivorship-bias-free collection of US equity prices. Understanding survivorship bias is critical for realistic backtesting.
02 Corporate Actions
This notebook demonstrates how stock splits and dividends break historical price series, and shows the industry-standard backward adjustment methodology used by major data vendors. Correctly adjusting for corporate actions is essential for any ML model using return-based features.
03 Etfs Eda
This notebook introduces the 50-ETF universe that serves as the foundation for the ETF Rotational Momentum case study throughout the book. We explore the schema, coverage, categories, and data quality characteristics.
04 Cme Futures Eda
This notebook introduces the CME futures dataset shipped with the book. It demonstrates the data structure, coverage, and key concepts for working with futures data.
05 Futures Session Aggregation
This notebook converts hourly continuous futures data (stored in UTC) to session-aware daily bars. CME futures sessions end at 4:00 PM Central Time, so daily bars must respect this boundary—not midnight UTC.
06 Futures Continuous
This notebook tackles one of the most critical challenges in futures analysis: creating continuous price series from individual expiring contracts. We implement roll detection algorithms and adjustment methods (Panama, ratio) to eliminate artificial price gaps while preserving accurate return characteristics.
07 Sp500 Options Eda
This notebook provides a comprehensive exploration of the AlgoSeek S&P 500 Options Analytics dataset. Options data is fundamentally different from spot market data—it contains forward-looking information about expected volatility, directional sentiment, and tail risk that isn't directly observable in underlying prices.
08 Options Greeks Computation
This notebook derives and implements the Black-Scholes option pricing framework from first principles. We compute implied volatility and all Greeks, then validate our calculations against the pre-computed values in the AlgoSeek options data.
09 Options Continuous
Options are time-decaying instruments. Unlike equities or futures, an option's price reflects both the value of the underlying exposure and the remaining time to expiration.
10 Crypto Perps Eda
This notebook introduces the cryptocurrency dataset from Binance Futures. We explore hourly OHLCV data and the Premium Index (perpetual futures vs spot spread) that forms the basis for the Crypto Premium Arbitrage case study.
11 Crypto Premium Analysis
This notebook demonstrates how to work with Binance perpetual futures premium index data - the foundation for funding rate arbitrage strategies. We load, explore, and analyze premium dynamics across major cryptocurrencies to identify potential arbitrage opportunities.
12 Fx Pairs Eda
This notebook introduces the FX dataset from OANDA. FX markets are OTC with no centralized exchange, so prices aggregate from multiple liquidity providers.
13 Data Quality Framework
The ml4t-data library provides purpose-built tools for financial data quality. This notebook demonstrates the complete data quality workflow: Uses us_equities data.
14 Point In Time Validation
Point-in-time correctness is essential for valid backtesting. Using information that wasn't available at decision time creates lookahead bias - making backtests look better than they would perform in live trading.
15 Survivorship Bias Detection
Survivorship bias is arguably the most dangerous form of data contamination in quantitative finance. This notebook uses real historical data from the US equities dataset (US Equities, originally Quandl WIKI) to demonstrate, detect, and quantify survivorship bias.
16 Provider Comparison
ML4T Third Edition - Chapter 2: The Financial Data...
17 Complete Pipeline
This notebook demonstrates end-to-end data pipelines, bringing together concepts from this chapter: Uses crypto_perps, wiki_provider data.
18 Data Management
Previous notebooks fetched and validated data. This notebook shows how to manage it at scale using ml4t-data's production features: Uses universe data.
19 Incremental Updates
The previous notebook introduced DataManager and HiveStorage. This notebook focuses on the update workflow — the core reason ml4t-data exists: Uses all, treasury_yields data.
20 Storage Benchmark File
Focus: Pure file format comparison (no query engines) Technologies: CSV, Parquet, Feather (Arrow IPC), HDF5 Operations: Write, Read (with forced materialization), Columnar...
21 Storage Benchmark Database
> Docker required: This notebook depends on the benchmark environment and > database services.
Edwin J. Elton et al. (1996) — The Review of Financial Studies · 563 citations
By tracking every mutual fund from 1976 through 1993, this paper quantifies survivorship bias at approximately 90 basis points per year, revealing that ignoring failed funds significantly inflates historical performance estimates.
Tyler Shumway (1997) — The Journal of Finance · 1105 citations
This paper identifies a significant delisting bias in the CRSP database due to missing delisting returns for stocks delisted for negative reasons, leading to overstated portfolio returns.
Mark M. Carhart et al. (2002) — The Review of Financial Studies · 166 citations
Survivorship bias is not constant but increases with sample length, inflating annual returns by up to 1% in samples longer than 15 years and distorting factor analysis.
William Beaver et al. (2007) — Journal of Accounting and Economics · 233 citations
This paper examines how the inclusion or exclusion of delisting returns affects the performance of trading strategies based on accounting anomalies, finding that the impact varies depending on the specific anomaly.
Seven Sins of Quantitative Investing
Yin Luo et al. (2014)
A comprehensive empirical audit of seven common backtesting biases, demonstrating how errors in data handling (survivorship, look-ahead) and modeling (outliers, signal decay) can invert strategy performance from profitable to disastrous.
A seminal overview of how high-frequency trading has rendered traditional microstructure metrics (like trade direction and realized spreads) obsolete, requiring new approaches to data analysis and execution.
Validates that daily low-frequency data can accurately proxy FX transaction costs and demonstrates that liquidity evaporates globally when funding constraints (TED spread) and volatility (VIX) rise.
Kingsley Y L Fong et al. (2017) — Review of Finance · 434 citations
Using 19 years of global intraday trades/quotes as ground truth, the paper identifies which daily-data liquidity proxies best approximate spreads and price impact—finding Closing Percent Quoted Spread (CPQS) is best for percent-cost, and Amihud/impact-style proxies are best for Kyle’s lambda correlations but not its level.
Advances in Financial Machine Learning
Marcos Lopez de Prado (2018) — John Wiley & Sons · 106 citations
John Lehoczky and Mark Schervish (2018) — Annual Review of Statistics and Its Application · 13 citations
A historically organized survey of how equity-market data, statistical models, and trading strategies evolved from pre-CRSP fundamentals/technical analysis to statistical arbitrage, machine learning, and today’s electronic limit-order-book/HFT markets.
Tim Loughran and Bill McDonald (2020) — Annual Review of Financial Economics · 169 citations
A practitioner-oriented review of how finance uses text (social media, politics, fraud) that argues “readability” metrics like the Fog Index are mis-specified for 10-Ks and should be replaced by text-based measures of firm complexity.
Gene Ekster and Petter N. Kolm (2020) · 5 citations
A comprehensive guide to the alternative data ecosystem, detailing a specific preprocessing pipeline (entity tagging, stabilization, debiasing) that reduces revenue prediction error from 88% to 2.6%.
Alex Lipton and Marcos Lopez de Prado (2020) · 5 citations
This note argues that COVID-19 exposed structural weaknesses in many quant workflows—overreliance on long-horizon forecasting, backtest-driven “discoveries,” and belief in all-weather alphas—and proposes nowcasting, theory-first research, and regime-adaptive ensembles as remedies.
A comprehensive regulatory report detailing the dominance of algorithmic trading in U.S. markets, concluding that while it improves liquidity and efficiency in normal conditions, it introduces operational risks and potential instability during stress events.
Igor Makarov and Antoinette Schoar (2020) — Journal of Financial Economics · 597 citations
This paper documents massive, persistent arbitrage spreads in cryptocurrency markets driven by capital controls and identifies that a single common order flow factor explains 80% of Bitcoin returns.
David Easley et al. (2021) — The Review of Financial Studies · 68 citations
Machine learning demonstrates that traditional microstructure measures (specifically VPIN and Roll) retain significant out-of-sample predictive power for volatility and liquidity in modern high-frequency markets, especially when using dollar-volume bars and cross-asset features.
Florian Berg et al. (2022) — Review of Finance · 1489 citations
ESG ratings from major providers exhibit low correlation (average 0.54), driven primarily by disagreements on how to measure specific attributes (56% of divergence) rather than what attributes to include or how to weight them.
David Vidal-Tomás (2022) — International Review of Financial Analysis · 93 citations
This paper finds that major cryptocurrency coin-ranking sites like Coinmarketcap and Coingecko provide data with the same underlying statistical properties as direct exchange data for liquid cryptocurrencies, making them suitable for research despite concerns about data aggregation.
Lin William Cong et al. (2023) — Management Science · 124 citations
A systematic forensic analysis of cryptocurrency exchanges revealing that over 70% of reported volume on unregulated platforms is wash trading, detected via statistical anomalies in trade size distributions.
This paper derives the theoretical no-arbitrage price for cryptocurrency perpetual futures and demonstrates that a threshold-based trading strategy exploiting deviations from this price yields Sharpe ratios between 1.8 and 3.5.
A practitioner-focused guide to three backtest frameworks (walk-forward, resampling, Monte Carlo) and to avoiding common biases—especially selection bias from multiple testing that inflates Sharpe ratios.
Gueorgui S. Konstantinov (2025) — The Journal of Portfolio Management
A framework for transforming currency from a passive hedging byproduct into an active alpha source by integrating carry, value, trend, and volatility styles via risk budgeting.
B. Espen Eckbo and Markus Lithell (2025) — Journal of Financial and Quantitative Analysis · 3 citations
This paper argues that the decline in U.S. stock market listings after 1996 is overstated because it doesn't account for firms that remain under public ownership after being acquired by publicly listed companies, and that once merger activity is considered, the U.S. exhibits a listing advantage due to its efficient market for mergers.
This paper provides the first empirical evidence of cross-market price discovery in modern prediction markets, finding significant arbitrage opportunities and demonstrating that Polymarket leads Kalshi in information aggregation, driven by liquidity and informed 'whale' trades.
Understanding the Market for U.S. Equity Market Data
Charles M Jones
An economic analysis arguing that U.S. equity market data fees are stable, competitively priced, and constitute a negligible fraction of total industry costs compared to commissions and third-party vendor fees.