ML4T Data¶

Fetch, validate, store, and refresh market datasets without rebuilding ingestion code for each provider.

ml4t-data is the data layer in the ML4T stack. It gives you provider adapters, storage, validation, and update workflows for the datasets you use in research and model development. Start with the Quickstart for a first result, move to the User Guide for recurring data workflows, use the API Reference for exact interfaces, and follow the Book Guide if you are working from Machine Learning for Trading, Third Edition.

Chapter 2 of the book builds data pipelines step by step in notebooks. This library packages those ideas as reusable functions: provider adapters, storage backends, validation checks, and incremental updates. See the Book Guide to map notebook code to library calls.

Get Data in 3 Lines

No API keys for basic usage. Fetch OHLCV or factor data directly from Yahoo Finance, CoinGecko, Fama-French, FRED, and other free sources.

Quickstart
19 Provider Adapters

Equities, crypto, forex, futures, macro, prediction markets, and academic factors. Each adapter handles source-specific auth, pagination, and schema normalization.

Provider Guide
Validated and Resumable

Run OHLC validation, detect gaps, and update stored datasets without re-fetching completed history. Resume interrupted refresh jobs with metadata-backed incremental updates.

Data Quality
Built for the Book

Reference implementation for the data workflows used in Chapters 2, 4, and 17-19. Use the same provider classes and managers that power the book code.

Book Guide

Quick Example¶

from ml4t.data.providers import YahooFinanceProvider

provider = YahooFinanceProvider()
df = provider.fetch_ohlcv("AAPL", "2024-01-01", "2024-12-31")

print(df.head())
# shape: (5, 7)
# ┌─────────────────────┬────────┬────────┬────────┬────────┬────────┬──────────┐
# │ timestamp           ┆ symbol ┆ open   ┆ high   ┆ low    ┆ close  ┆ volume   │
# │ ---                 ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---      │
# │ datetime[μs, UTC]   ┆ str    ┆ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64      │
# ╞═════════════════════╪════════╪════════╪════════╪════════╪════════╪══════════╡
# │ 2024-01-02 00:00:00 ┆ AAPL   ┆ 187.15 ┆ 188.44 ┆ 183.89 ┆ 185.64 ┆ 82488700 │
# └─────────────────────┴────────┴────────┴────────┴────────┴────────┴──────────┘

Core Workflows¶

Fetch Multiple Symbols¶

import asyncio
from ml4t.data.managers.async_batch import async_batch_load
from ml4t.data.providers import YahooFinanceProvider

async def fetch_universe():
    async with YahooFinanceProvider() as provider:
        return await async_batch_load(
            provider,
            symbols=["AAPL", "MSFT", "GOOGL", "AMZN", "META"],
            start="2024-01-01",
            end="2024-12-31",
            max_concurrent=10,
        )

df = asyncio.run(fetch_universe())
print(df["symbol"].n_unique())  # 5

Use this when you need one request surface for many symbols and want failures, rate limits, and retries handled consistently by the provider layer.

Keep Stored Data Current¶

from pathlib import Path
from ml4t.data import DataManager
from ml4t.data.storage import HiveStorage
from ml4t.data.storage.backend import StorageConfig

manager = DataManager(storage=HiveStorage(StorageConfig(base_path=Path("./data"))))
manager.update("AAPL", provider="yahoo", asset_class="equities", frequency="daily")

This workflow re-fetches only the recent overlap window, merges new data into storage, and leaves update metadata you can inspect later.

Validate Data Before You Train¶

from ml4t.data.validation import OHLCVValidator

validator = OHLCVValidator(max_return_threshold=0.5)
report = validator.validate(df)

print(report.passed)
print(report.errors[:3])

Run validation before feature engineering or backtesting so obviously bad bars, duplicates, or extreme-return anomalies do not leak into downstream workflows.

Provider Comparison¶

Provider	Asset Class	Example Coverage	API Key	Best For
Yahoo Finance	Stocks, ETFs, Crypto	Broad US and global listed assets	No	First fetches, daily OHLCV
CoinGecko	Crypto	10K+ coins and tokens	No	Free crypto historical data
Binance Bulk	Crypto	Spot and USD-M futures archive files	No	Bulk historical crypto downloads
EODHD	Global Stocks	60+ exchanges	Yes	International equity coverage
FRED	Macro	Rates, inflation, labor, GDP series	No	Macro research inputs
DataBento	Futures, Options	Institutional derivatives feeds	Yes	Exchange-grade futures data
Fama-French	Factors	Core academic factor datasets	No	Asset pricing research

Full provider comparison

Installation¶

pipuvWith Optional Providers

pip install ml4t-data

uv add ml4t-data

pip install "ml4t-data[yahoo,databento]"

Part of the ML4T Library Suite¶

ml4t-data prepares the datasets used by the other ML4T libraries:

ml4t-data fetches, validates, stores, and refreshes market data
ml4t-engineer builds labels and features from those datasets
ml4t-diagnostic validates signals, models, and portfolio behavior
ml4t-backtest simulates execution on the resulting price and signal data
ml4t-live reuses the same strategy workflows in paper and live trading

If you are coming from the book, use the docs in this order:

Start here for the practical workflows the library automates
Use the Book Guide to map chapter code to library surfaces
Move into the User Guide for storage, validation, and updates
Finish in the API Reference when you need exact call signatures

Next Steps¶

Quickstart

Run your first provider fetch and inspect the returned Polars DataFrame.
Provider Selection

Choose the right provider for your asset class, budget, and workflow.
User Guide

Learn storage, validation, incremental updates, and recurring dataset workflows.
Book Guide

Match ML4T chapter code to reusable managers, providers, and scripts.