ML4T Book Guide¶
This guide maps ml4t-data to the current
Machine Learning for Trading, Third Edition
codebase so you can move cleanly between the book's pedagogical examples and the
library's reusable APIs.
How to Read This Guide¶
The book uses ml4t-data in three distinct ways:
- provider-level exploration inside chapter notebooks and companion scripts
- library-level workflows built around
DataManager, storage, validation, and updates - dataset download scripts under
code/data/that package canonical book data into repeatable pipelines
If you are starting from the book, the fastest path is:
- follow the chapter script that introduces the concept
- identify the corresponding
ml4t-dataclass or provider - reuse that API in your own config-driven workflow
Book Entry Points¶
| Goal | Library surface | Book path |
|---|---|---|
| Download the canonical datasets used across the book | ETFDataManager, CryptoDataManager, MacroDataManager, FuturesDataManager |
code/data/download_all.py |
| Learn provider-first data exploration | provider classes such as YahooFinanceProvider, BinanceBulkProvider, FREDProvider |
code/data/etfs/notebook.py, code/data/crypto/notebook.py, code/data/macro/notebook.py |
| Build a reusable storage-backed data workflow | DataManager, HiveStorage, Universe |
code/02_financial_data_universe/18_data_management.py |
| Understand updates, validation, and health checks | DataManager.update(), GapDetector, OHLCVValidator |
code/02_financial_data_universe/19_incremental_updates.py |
| See the end-to-end Chapter 2 workflow | providers plus storage and validation APIs | code/02_financial_data_universe/17_complete_pipeline.py |
Chapter-to-Feature Mapping¶
Chapter 2: Financial Data Universe¶
This is the deepest ml4t-data integration point in the book.
| Book path | What it teaches | ml4t-data surface |
|---|---|---|
| 02_corporate_actions.py | price adjustment workflows | apply_corporate_actions, YahooFinanceProvider |
| 16_provider_comparison.py | source tradeoffs across free and paid providers | YahooFinanceProvider, WikiPricesProvider, EODHDProvider, FREDProvider |
| 17_complete_pipeline.py | complete data pipeline | DataManager, HiveStorage, Universe, GapDetector, OHLCVValidator |
| 18_data_management.py | storage-backed data management | DataManager, HiveStorage, Universe |
| 19_incremental_updates.py | refreshing existing datasets safely | DataManager.update(), GapDetector, OHLCVValidator |
| 13_data_quality_framework.py | structural validation and anomaly detection | OHLCVValidator, AnomalyManager |
Chapter 4: Fundamental and Alternative Data¶
This chapter leans on provider adapters more than on DataManager.
| Book path | Provider(s) | Notes |
|---|---|---|
| 08_fred_macro_eda.py | FREDProvider |
Macro and rate series |
| 09_macro_data_alignment.py | FREDProvider |
Vintage-aware macro discussion |
| 11_onchain_fundamentals.py | CoinGeckoProvider |
Crypto prices paired with on-chain metrics |
| 13_kalshi_prediction_markets.py | KalshiProvider |
Event-contract OHLCV |
| 14_polymarket_prediction_markets.py | PolymarketProvider |
Prediction market history and snapshots |
Chapters 17 and 19: Factors, Allocation, and Risk¶
These chapters use ml4t-data for factor datasets and risk-free inputs.
| Book path | Provider(s) | Notes |
|---|---|---|
| 17_portfolio_construction/02_mean_variance_optimization.py | FREDProvider |
Risk-free rate loading |
| 17_portfolio_construction/05_factor_allocation_evidence.py | FamaFrenchProvider, AQRFactorProvider |
Long-history factor evidence |
| 17_portfolio_construction/08_library_comparison.py | FREDProvider |
Comparison workflows with real rates |
| 19_risk_management/04_factor_exposure.py | FamaFrenchProvider |
Factor loading and exposure analysis |
Canonical Download Scripts¶
The code/data/ directory is where book examples become reusable download
pipelines. These are the most direct examples of how the library is intended to
be used outside a notebook.
| Script | ml4t-data surface |
Dataset role |
|---|---|---|
| code/data/etfs/download.py | ETFDataManager |
Core ETF and benchmark history |
| code/data/crypto/download.py | BinanceBulkProvider, CryptoDataManager |
Spot and futures crypto data |
| code/data/macro/download.py | FREDProvider, MacroDataManager |
Treasury and macro series |
| code/data/factors/ff_download.py | FamaFrenchProvider |
Factor benchmarks |
| code/data/factors/aqr_download.py | AQRFactorProvider |
Extended factor datasets |
| code/data/prediction_markets/download.py | KalshiProvider, PolymarketProvider |
Alternative market data |
From Notebook Code to Reusable Workflows¶
The book often introduces a concept manually before the library wraps it in a more reusable API. A few common transitions:
| In the book | In the library | Why it matters |
|---|---|---|
| provider calls inside one notebook | DataManager.fetch() |
consistent routing and output handling |
| ad hoc lists of symbols | Universe |
reusable universes and custom symbol sets |
| one-off Parquet writes | HiveStorage |
partition pruning, metadata, and update support |
| manual gap checks | GapDetector and DataManager.update() |
safer recurring refreshes |
| exploratory validation logic | OHLCVValidator and AnomalyManager |
repeatable data-quality gates |
| chapter download helpers | update-all plus manager classes |
automation outside the notebook |
Suggested Reader Journey¶
If you want the shortest path from the book to production-style code:
- Read 17_complete_pipeline.py.
- Move to 18_data_management.py for
DataManager,Universe, and storage. - Read 19_incremental_updates.py before setting up recurring refreshes.
- Use code/data/download_all.py as the reference shape for your own dataset automation.