ML4T Data¶
Fetch, validate, store, and refresh market datasets without rebuilding ingestion code for each provider.
ml4t-data is the data layer in the ML4T stack. It gives you provider adapters,
storage, validation, and update workflows for the datasets you use in research
and model development. Start with the Quickstart
for a first result, move to the User Guide for recurring
data workflows, use the API Reference for exact interfaces, and
follow the Book Guide if you are working from Machine
Learning for Trading, Third Edition.
Chapter 2 of the book builds data pipelines step by step in notebooks. This library packages those ideas as reusable functions: provider adapters, storage backends, validation checks, and incremental updates. See the Book Guide to map notebook code to library calls.
-
Get Data in 3 Lines
No API keys for basic usage. Fetch OHLCV or factor data directly from Yahoo Finance, CoinGecko, Fama-French, FRED, and other free sources.
-
19 Provider Adapters
Equities, crypto, forex, futures, macro, prediction markets, and academic factors. Each adapter handles source-specific auth, pagination, and schema normalization.
-
Validated and Resumable
Run OHLC validation, detect gaps, and update stored datasets without re-fetching completed history. Resume interrupted refresh jobs with metadata-backed incremental updates.
-
Built for the Book
Reference implementation for the data workflows used in Chapters 2, 4, and 17-19. Use the same provider classes and managers that power the book code.
Quick Example¶
from ml4t.data.providers import YahooFinanceProvider
provider = YahooFinanceProvider()
df = provider.fetch_ohlcv("AAPL", "2024-01-01", "2024-12-31")
print(df.head())
# shape: (5, 7)
# ┌─────────────────────┬────────┬────────┬────────┬────────┬────────┬──────────┐
# │ timestamp ┆ symbol ┆ open ┆ high ┆ low ┆ close ┆ volume │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ datetime[μs, UTC] ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
# ╞═════════════════════╪════════╪════════╪════════╪════════╪════════╪══════════╡
# │ 2024-01-02 00:00:00 ┆ AAPL ┆ 187.15 ┆ 188.44 ┆ 183.89 ┆ 185.64 ┆ 82488700 │
# └─────────────────────┴────────┴────────┴────────┴────────┴────────┴──────────┘
Core Workflows¶
Fetch Multiple Symbols¶
import asyncio
from ml4t.data.managers.async_batch import async_batch_load
from ml4t.data.providers import YahooFinanceProvider
async def fetch_universe():
async with YahooFinanceProvider() as provider:
return await async_batch_load(
provider,
symbols=["AAPL", "MSFT", "GOOGL", "AMZN", "META"],
start="2024-01-01",
end="2024-12-31",
max_concurrent=10,
)
df = asyncio.run(fetch_universe())
print(df["symbol"].n_unique()) # 5
Use this when you need one request surface for many symbols and want failures, rate limits, and retries handled consistently by the provider layer.
Keep Stored Data Current¶
from pathlib import Path
from ml4t.data import DataManager
from ml4t.data.storage import HiveStorage
from ml4t.data.storage.backend import StorageConfig
manager = DataManager(storage=HiveStorage(StorageConfig(base_path=Path("./data"))))
manager.update("AAPL", provider="yahoo", asset_class="equities", frequency="daily")
This workflow re-fetches only the recent overlap window, merges new data into storage, and leaves update metadata you can inspect later.
Validate Data Before You Train¶
from ml4t.data.validation import OHLCVValidator
validator = OHLCVValidator(max_return_threshold=0.5)
report = validator.validate(df)
print(report.passed)
print(report.errors[:3])
Run validation before feature engineering or backtesting so obviously bad bars, duplicates, or extreme-return anomalies do not leak into downstream workflows.
Provider Comparison¶
| Provider | Asset Class | Example Coverage | API Key | Best For |
|---|---|---|---|---|
| Yahoo Finance | Stocks, ETFs, Crypto | Broad US and global listed assets | No | First fetches, daily OHLCV |
| CoinGecko | Crypto | 10K+ coins and tokens | No | Free crypto historical data |
| Binance Bulk | Crypto | Spot and USD-M futures archive files | No | Bulk historical crypto downloads |
| EODHD | Global Stocks | 60+ exchanges | Yes | International equity coverage |
| FRED | Macro | Rates, inflation, labor, GDP series | No | Macro research inputs |
| DataBento | Futures, Options | Institutional derivatives feeds | Yes | Exchange-grade futures data |
| Fama-French | Factors | Core academic factor datasets | No | Asset pricing research |
Installation¶
Part of the ML4T Library Suite¶
ml4t-data prepares the datasets used by the other ML4T libraries:
ml4t-datafetches, validates, stores, and refreshes market dataml4t-engineerbuilds labels and features from those datasetsml4t-diagnosticvalidates signals, models, and portfolio behaviorml4t-backtestsimulates execution on the resulting price and signal dataml4t-livereuses the same strategy workflows in paper and live trading
If you are coming from the book, use the docs in this order:
- Start here for the practical workflows the library automates
- Use the Book Guide to map chapter code to library surfaces
- Move into the User Guide for storage, validation, and updates
- Finish in the API Reference when you need exact call signatures
Next Steps¶
-
Run your first provider fetch and inspect the returned Polars DataFrame.
-
Choose the right provider for your asset class, budget, and workflow.
-
Learn storage, validation, incremental updates, and recurring dataset workflows.
-
Match ML4T chapter code to reusable managers, providers, and scripts.