Configuration¶
ml4t-data uses YAML configuration files validated by Pydantic models. The config system supports environment variable interpolation, file includes, and per-asset-class defaults.
Use this page when you want a repeatable, config-driven workflow instead of ad hoc provider calls. It is the right place to start for scheduled downloads, multi-dataset pipelines, and book-style dataset orchestration.
Minimal Working Example¶
storage:
path: ~/ml4t-data
datasets:
etf_core:
provider: yahoo
symbols: [SPY, QQQ, IWM, TLT, GLD]
frequency: daily
from pathlib import Path
from ml4t.data.config import load_config
config = load_config(Path("ml4t-data.yaml"))
print(config)
File Discovery¶
The ConfigLoader searches these locations in order:
./ml4t.data.yaml./ml4t.data.yml./.ml4t-data.yaml./.ml4t-data.yml./config/ml4t.data.yaml~/.config/ml4t-data/config.yaml
You can also pass an explicit path:
Full Configuration Example¶
version: "1.0"
base_dir: ./data
log_level: INFO
parallel_downloads: 4
# Storage backend
storage:
strategy: hive # "hive" (partitioned) or "flat" (single file)
base_path: ~/ml4t-data
compression: zstd # zstd, lz4, snappy, or none
partition_granularity: month # year, month, day, or hour
atomic_writes: true
enable_locking: true
metadata_tracking: true
# Data providers
providers:
- name: yahoo
type: yahoo
enabled: true
rate_limit:
requests_per_second: 5.0
retry_max_attempts: 3
timeout: 30
- name: databento
type: databento
api_key: ${DATABENTO_API_KEY}
rate_limit:
requests_per_second: 10.0
circuit_breaker_threshold: 5
circuit_breaker_timeout: 60
# Symbol universes
universes:
- name: tech_stocks
symbols: [AAPL, MSFT, GOOGL, AMZN, META]
asset_class: equity
- name: sp500
file: symbols/sp500.txt # one symbol per line
provider: yahoo
asset_class: equity
# Datasets
datasets:
- name: us_equities
universe: sp500
provider: yahoo
frequency: daily
asset_class: equity
update_mode: incremental
validation_enabled: true
anomaly_detection: false
- name: crypto_spot
symbols: [BTC, ETH, SOL]
provider: binance_api
frequency: hourly
asset_class: crypto
# Workflows
workflows:
- name: daily_update
datasets: [us_equities, crypto_spot]
schedule:
type: daily
time: "18:00"
timezone: US/Eastern
on_error: continue
Environment Variables¶
API keys and secrets support ${VAR} interpolation with optional defaults:
providers:
- name: polygon
api_key: ${POLYGON_API_KEY} # required
api_secret: ${POLYGON_SECRET:default} # with fallback
The env section defines variables that are set if not already present:
ml4t-data also reads .env files automatically via Pydantic Settings.
Key Configuration Sections¶
Storage¶
| Field | Default | Description |
|---|---|---|
strategy |
hive |
hive for partitioned Parquet, flat for single files |
base_path |
./data |
Base directory (supports ~ expansion) |
compression |
zstd |
Parquet compression: zstd, lz4, snappy, none |
partition_granularity |
month |
Hive partition level: year, month, day, hour |
atomic_writes |
true |
Write to temp file then rename |
enable_locking |
true |
File locking for concurrent access |
metadata_tracking |
true |
JSON manifest files alongside data |
Providers¶
Each provider entry configures connection parameters:
| Field | Default | Description |
|---|---|---|
name |
required | Provider identifier |
type |
required | yahoo, binance_api, binance_bulk, cryptocompare, databento, oanda, polygon, mock |
enabled |
true |
Toggle provider on/off |
api_key |
null |
API key (use ${ENV_VAR} format) |
rate_limit |
see below | Rate limiting and circuit breaker config |
timeout |
30 |
Request timeout in seconds |
cache_enabled |
true |
Response caching |
cache_ttl |
3600 |
Cache time-to-live in seconds |
Rate limiting sub-config:
| Field | Default | Description |
|---|---|---|
requests_per_second |
10.0 |
Maximum request rate |
burst_size |
1 |
Burst allowance |
retry_max_attempts |
3 |
Retry count |
retry_backoff_factor |
2.0 |
Exponential backoff multiplier |
circuit_breaker_threshold |
5 |
Failures before circuit opens |
circuit_breaker_timeout |
60 |
Seconds before circuit half-opens |
Schedules¶
Workflows support five schedule types:
# Cron expression
schedule:
type: cron
cron: "0 18 * * 1-5"
# Fixed interval (seconds)
schedule:
type: interval
interval: 3600
# Daily at specific time
schedule:
type: daily
time: "18:00"
timezone: US/Eastern
# Weekly
schedule:
type: weekly
time: "09:00"
weekday: 0 # Monday
# Relative to market hours
schedule:
type: market_hours
market_close_offset: 30 # 30 minutes before close
File Includes¶
Split large configs across files with the include directive:
# ml4t.data.yaml
include:
- providers/yahoo.yaml
- providers/databento.yaml
- universes/equities.yaml
datasets:
- name: us_equities
universe: sp500
provider: yahoo
Included files are merged recursively. The main file takes priority over includes.
Validation¶
Use ConfigValidator to check for consistency errors before running:
from ml4t.data.config import load_config, ConfigValidator
config = load_config()
validator = ConfigValidator(config)
if not validator.validate():
for error in validator.errors:
print(f"ERROR: {error}")
for warning in validator.warnings:
print(f"WARNING: {warning}")
The validator checks for:
- Duplicate provider, dataset, or universe names
- Datasets referencing non-existent providers or universes
- Workflows referencing non-existent datasets
- Invalid date ranges (start >= end)
- Missing cron expressions or interval values in schedules
- Orphaned providers or datasets not used by any workflow
Programmatic Access¶
from ml4t.data.config import DataConfig
# Load from YAML
config = DataConfig.from_yaml("ml4t.data.yaml")
# Look up components
provider = config.get_provider("yahoo")
universe = config.get_universe("sp500")
dataset = config.get_dataset("us_equities")
# Validate references
issues = config.validate_config()
# Save back to YAML
config.to_yaml("ml4t.data.yaml")
See It In The Book¶
The book codebase uses the same pattern for canonical dataset automation:
These files show how the book moves from one-off notebook exploration to reusable dataset definitions that can be updated repeatedly.