Home / Libraries / ML4T Data / Docs
ML4T Data
ML4T Data Documentation
Unified market data acquisition from 19+ providers
Skip to content

ML4T Data

Fetch, validate, store, and refresh market datasets without rebuilding ingestion code for each provider.

ml4t-data is the data layer in the ML4T stack. It gives you provider adapters, storage, validation, and update workflows for the datasets you use in research and model development. Start with the Quickstart for a first result, move to the User Guide for recurring data workflows, use the API Reference for exact interfaces, and follow the Book Guide if you are working from Machine Learning for Trading, Third Edition.

Chapter 2 of the book builds data pipelines step by step in notebooks. This library packages those ideas as reusable functions: provider adapters, storage backends, validation checks, and incremental updates. See the Book Guide to map notebook code to library calls.

  • Get Data in 3 Lines


    No API keys for basic usage. Fetch OHLCV or factor data directly from Yahoo Finance, CoinGecko, Fama-French, FRED, and other free sources.

    Quickstart

  • 19 Provider Adapters


    Equities, crypto, forex, futures, macro, prediction markets, and academic factors. Each adapter handles source-specific auth, pagination, and schema normalization.

    Provider Guide

  • Validated and Resumable


    Run OHLC validation, detect gaps, and update stored datasets without re-fetching completed history. Resume interrupted refresh jobs with metadata-backed incremental updates.

    Data Quality

  • Built for the Book


    Reference implementation for the data workflows used in Chapters 2, 4, and 17-19. Use the same provider classes and managers that power the book code.

    Book Guide

Quick Example

from ml4t.data.providers import YahooFinanceProvider

provider = YahooFinanceProvider()
df = provider.fetch_ohlcv("AAPL", "2024-01-01", "2024-12-31")

print(df.head())
# shape: (5, 7)
# ┌─────────────────────┬────────┬────────┬────────┬────────┬────────┬──────────┐
# │ timestamp           ┆ symbol ┆ open   ┆ high   ┆ low    ┆ close  ┆ volume   │
# │ ---                 ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---      │
# │ datetime[μs, UTC]   ┆ str    ┆ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64      │
# ╞═════════════════════╪════════╪════════╪════════╪════════╪════════╪══════════╡
# │ 2024-01-02 00:00:00 ┆ AAPL   ┆ 187.15 ┆ 188.44 ┆ 183.89 ┆ 185.64 ┆ 82488700 │
# └─────────────────────┴────────┴────────┴────────┴────────┴────────┴──────────┘

Core Workflows

Fetch Multiple Symbols

import asyncio
from ml4t.data.managers.async_batch import async_batch_load
from ml4t.data.providers import YahooFinanceProvider

async def fetch_universe():
    async with YahooFinanceProvider() as provider:
        return await async_batch_load(
            provider,
            symbols=["AAPL", "MSFT", "GOOGL", "AMZN", "META"],
            start="2024-01-01",
            end="2024-12-31",
            max_concurrent=10,
        )

df = asyncio.run(fetch_universe())
print(df["symbol"].n_unique())  # 5

Use this when you need one request surface for many symbols and want failures, rate limits, and retries handled consistently by the provider layer.

Keep Stored Data Current

from pathlib import Path
from ml4t.data import DataManager
from ml4t.data.storage import HiveStorage
from ml4t.data.storage.backend import StorageConfig

manager = DataManager(storage=HiveStorage(StorageConfig(base_path=Path("./data"))))
manager.update("AAPL", provider="yahoo", asset_class="equities", frequency="daily")

This workflow re-fetches only the recent overlap window, merges new data into storage, and leaves update metadata you can inspect later.

Validate Data Before You Train

from ml4t.data.validation import OHLCVValidator

validator = OHLCVValidator(max_return_threshold=0.5)
report = validator.validate(df)

print(report.passed)
print(report.errors[:3])

Run validation before feature engineering or backtesting so obviously bad bars, duplicates, or extreme-return anomalies do not leak into downstream workflows.

Provider Comparison

Provider Asset Class Example Coverage API Key Best For
Yahoo Finance Stocks, ETFs, Crypto Broad US and global listed assets No First fetches, daily OHLCV
CoinGecko Crypto 10K+ coins and tokens No Free crypto historical data
Binance Bulk Crypto Spot and USD-M futures archive files No Bulk historical crypto downloads
EODHD Global Stocks 60+ exchanges Yes International equity coverage
FRED Macro Rates, inflation, labor, GDP series No Macro research inputs
DataBento Futures, Options Institutional derivatives feeds Yes Exchange-grade futures data
Fama-French Factors Core academic factor datasets No Asset pricing research

Full provider comparison

Installation

pip install ml4t-data
uv add ml4t-data
pip install "ml4t-data[yahoo,databento]"

Part of the ML4T Library Suite

ml4t-data prepares the datasets used by the other ML4T libraries:

  1. ml4t-data fetches, validates, stores, and refreshes market data
  2. ml4t-engineer builds labels and features from those datasets
  3. ml4t-diagnostic validates signals, models, and portfolio behavior
  4. ml4t-backtest simulates execution on the resulting price and signal data
  5. ml4t-live reuses the same strategy workflows in paper and live trading

If you are coming from the book, use the docs in this order:

  1. Start here for the practical workflows the library automates
  2. Use the Book Guide to map chapter code to library surfaces
  3. Move into the User Guide for storage, validation, and updates
  4. Finish in the API Reference when you need exact call signatures

Next Steps

  • Quickstart

    Run your first provider fetch and inspect the returned Polars DataFrame.

  • Provider Selection

    Choose the right provider for your asset class, budget, and workflow.

  • User Guide

    Learn storage, validation, incremental updates, and recurring dataset workflows.

  • Book Guide

    Match ML4T chapter code to reusable managers, providers, and scripts.