Incremental Updates¶
ml4t-data supports two update styles:
- high-level updates through
DataManager.update()for normal stored datasets - lower-level update planning and metadata tracking through the utilities in
ml4t.data.update_manager
The goal in both cases is the same: fetch only what changed, merge it into existing storage, detect gaps, and keep enough metadata to monitor dataset health over time.
What the Current Update Stack Does¶
For a stored dataset such as equities/daily/AAPL, the update flow is:
- Read the existing data from storage.
- Use the latest stored timestamp plus a configurable lookback window to decide what range to re-fetch.
- Fetch fresh data through the configured provider.
- Merge and deduplicate by timestamp.
- Optionally run gap detection and validation.
- Persist the updated dataset and metadata.
The main entry points are:
| Surface | Use when | Notes |
|---|---|---|
DataManager.update() |
You already have a storage-backed manager and want the normal library workflow | Best default for production pipelines |
ml4t-data update |
You want to update one stored symbol from the CLI | Supports incremental, append_only, full_refresh, and backfill strategies |
ml4t-data update-all |
You manage recurring datasets from a YAML config | Good for book datasets and cron-style automation |
IncrementalUpdater |
You need direct access to range planning and update strategies | Lower-level API in ml4t.data.update_manager |
MetadataTracker |
You need status, history, or health summaries | Backing store for status and health style reporting |
Recommended Python Workflow¶
from pathlib import Path
from ml4t.data import DataManager
from ml4t.data.storage import HiveStorage
from ml4t.data.storage.backend import StorageConfig
storage = HiveStorage(StorageConfig(base_path=Path("./data")))
manager = DataManager(storage=storage, enable_validation=True)
# First run: load historical data into storage
manager.load(
symbol="AAPL",
start="2020-01-01",
end="2024-12-31",
provider="yahoo",
asset_class="equities",
frequency="daily",
)
# Later runs: only fetch the recent range plus a small lookback window
key = manager.update(
symbol="AAPL",
provider="yahoo",
asset_class="equities",
frequency="daily",
lookback_days=7,
fill_gaps=True,
)
print(key) # equities/daily/AAPL
Use create_storage(..., strategy="flat") for smaller datasets, but prefer
Hive storage for time-series updates because it aligns with the library's
read, merge, and pruning workflow.
CLI Workflows¶
Update one dataset¶
ml4t-data update -s AAPL --provider yahoo --storage-path ./data
ml4t-data update -s AAPL --strategy full_refresh --storage-path ./data
ml4t-data update -s AAPL --strategy backfill --start 2020-01-01 --end 2020-12-31 \
--storage-path ./data
Inspect update status¶
ml4t-data status --storage-path ./data
ml4t-data status --detailed --storage-path ./data
ml4t-data health --storage-path ./data
Run recurring dataset updates from YAML¶
storage:
path: ~/ml4t-data
datasets:
etf_core:
provider: yahoo
symbols: [SPY, QQQ, IWM, TLT, GLD]
frequency: daily
macro:
provider: fred
symbols: [DGS3MO, CPIAUCSL, UNRATE]
frequency: daily
ml4t-data update-all -c ml4t-data.yaml
ml4t-data update-all -c ml4t-data.yaml --dataset macro
ml4t-data update-all -c ml4t-data.yaml --dry-run
Gap Detection and Validation¶
Incremental updates are useful only if the resulting data stays trustworthy.
ml4t-data uses two complementary mechanisms:
ml4t.data.utils.gaps.GapDetectordetects missing periods and summarizes gap counts and durationsOHLCVValidatorchecks structural issues such as bad OHLC relationships, duplicates, negative values, and stale or extreme-return patterns
from ml4t.data.utils.gaps import GapDetector
from ml4t.data.validation import OHLCVValidator
gaps = GapDetector().detect_gaps(df, frequency="daily", is_crypto=False)
validation = OHLCVValidator(max_return_threshold=0.5).validate(df)
print(len(gaps), validation.passed)
For crypto and other 24/7 markets, make sure the gap detector is configured with the right market assumptions. Intraday equities and continuous crypto have very different definitions of an "expected" gap.
Metadata and Health Tracking¶
MetadataTracker records update history and dataset summaries under
.metadata/ inside the storage root. This is what powers:
ml4t-data statusml4t-data health- per-dataset update histories and freshness checks
Each update record includes the provider, update type, date range, row counts, duration, and any gap-filling information. That makes it practical to answer:
- when was this dataset last updated?
- was the last run incremental or a full refresh?
- how many rows were added or rewritten?
- is the dataset stale or healthy?
Choosing an Update Strategy¶
| Strategy | Best for | Tradeoff |
|---|---|---|
incremental |
Normal daily or hourly refreshes | Re-fetches a short overlap window to stay robust |
append_only |
Immutable append workflows | Will not rewrite existing history |
full_refresh |
Provider corrections or schema changes | Most expensive option |
backfill |
Filling missing historical periods | Useful after outages or provider switches |
Default to incremental. Use full_refresh only when the upstream source or
your storage layout changed enough that merging old and new data is unsafe.
See It in the Book¶
The book codebase demonstrates the same update concepts from notebook-scale examples through reusable scripts:
- Complete pipeline in Chapter 2
- Data management in Chapter 2
- Incremental updates in Chapter 2
- Canonical dataset downloader
The progression is intentional:
- The chapter scripts show the mechanics directly with
DataManager,HiveStorage,GapDetector, andOHLCVValidator. - The
code/data/download scripts turn those ideas into repeatable dataset pipelines. - Your own production workflow can usually reuse the same library calls with a project-specific config and storage root.