Incremental Updates & Data Management¶
This document describes the incremental update system implemented in Sprint 004, including gap detection, file locking, chunked storage, and metadata tracking.
Overview¶
The incremental update system allows efficient data updates without re-downloading entire datasets. Key features include:
- Incremental Updates: Only fetch new data since last update
- Gap Detection: Identify and fill missing data points
- File Locking: Safe concurrent access to data files
- Chunked Storage: Split large datasets into manageable time-based chunks
- Metadata Tracking: Monitor dataset health and update history
CLI Commands¶
Initial Data Load¶
First time loading data for a symbol:
# Load historical data
ml4t-data load --provider yahoo --symbol AAPL --start 2023-01-01 --end 2024-01-01
# Load with specific frequency
ml4t-data load -p yahoo -s AAPL --start 2023-01-01 --end 2024-01-01 -f daily
Incremental Updates¶
Update existing data with new data:
# Basic update (uses existing provider)
ml4t-data update --symbol AAPL
# Update with options
ml4t-data update -s AAPL --lookback-days 10 --fill-gaps --show-status
# Update without gap filling
ml4t-data update -s AAPL --no-fill-gaps
# Update with different provider
ml4t-data update -s AAPL --provider yahoo
Options:
- --lookback-days/-l: Days to look back for validation (default: 7)
- --fill-gaps/--no-fill-gaps: Enable/disable gap filling (default: enabled)
- --show-status: Display detailed update status and history
- --provider/-p: Override provider (uses existing if not specified)
Health Monitoring¶
Check health status of all datasets:
# Basic health check
ml4t-data health
# Detailed health check
ml4t-data health --verbose
# Custom staleness threshold
ml4t-data health --stale-days 3 --verbose
Output shows: - Total datasets and their health status (✅ healthy, ⚠️ stale, ❌ error) - Total rows across all datasets - Breakdown by asset class - Individual dataset details (with --verbose)
Workflow Examples¶
Example 1: Daily Stock Data Updates¶
# Initial load of AAPL data
ml4t-data load -p yahoo -s AAPL --start 2023-01-01 --end 2024-01-01
# Daily update (run via cron)
ml4t-data update -s AAPL --show-status
# Check health weekly
ml4t-data health --verbose
Example 2: Multiple Symbol Management¶
# Load multiple symbols
for symbol in AAPL GOOGL MSFT NVDA; do
ml4t-data load -p yahoo -s $symbol --start 2023-01-01 --end 2024-01-01
done
# Update all symbols
for symbol in AAPL GOOGL MSFT NVDA; do
ml4t-data update -s $symbol
done
# Check overall health
ml4t-data health
Example 3: Crypto Data (24/7 Trading)¶
# Load crypto data
ml4t-data load -p yahoo -s BTC-USD --start 2023-01-01 --end 2024-01-01 -a crypto
# Update with gap detection (important for 24/7 markets)
ml4t-data update -s BTC-USD -a crypto --fill-gaps --show-status
Technical Details¶
Gap Detection¶
The system distinguishes between expected and unexpected gaps:
- Stock Markets: Weekends and after-hours are expected gaps
- Crypto Markets: 24/7 trading, all gaps are unexpected
- Configurable Tolerance: 10% default tolerance for gap detection
Gap filling methods:
- forward: Use last known value (default)
- backward: Use next known value
- interpolate: Linear interpolation
- zero: Fill with zeros
File Locking¶
Thread-safe and process-safe file locking ensures data integrity:
- Uses
filelocklibrary for cross-platform compatibility - Automatic lock acquisition and release
- Configurable timeout (default: 30 seconds)
- Prevents corruption during concurrent reads/writes
Chunked Storage¶
Large datasets are split into time-based chunks:
- Monthly chunks (default): 30-day periods
- Weekly chunks: 7-day periods
- Quarterly chunks: 90-day periods
- Yearly chunks: 365-day periods
Benefits: - Efficient incremental updates (only update relevant chunks) - Parallel processing capability - Reduced memory usage for large datasets - Fast time-range queries
Metadata Tracking¶
Each dataset maintains metadata including:
- Update history (last 100 updates)
- Health status (healthy/stale/error)
- Data range and row count
- Provider information
- Error tracking
Health checks consider: - Days since last update - Data currency (how far behind current date) - Recent error frequency
Architecture¶
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Provider │────▶│ Pipeline │────▶│ Storage │
└─────────────┘ └──────────────┘ └─────────────┘
│ │
▼ ▼
┌──────────────┐ ┌─────────────┐
│ Gap Detector │ │ File Lock │
└──────────────┘ └─────────────┘
│ │
▼ ▼
┌──────────────┐ ┌─────────────┐
│ Metadata │ │ Chunks │
│ Tracker │ │ Storage │
└──────────────┘ └─────────────┘
Components¶
- Pipeline (
src/ml4t-data/pipeline.py) - Orchestrates data flow
run_load(): Full data load-
run_update(): Incremental update -
Gap Detector (
src/ml4t-data/utils/gaps.py) - Identifies missing data points
- Market hours awareness
-
Multiple fill strategies
-
File Locking (
src/ml4t-data/utils/locking.py) - Thread/process-safe access
- Automatic cleanup
-
Timeout configuration
-
Chunked Storage (
src/ml4t-data/storage/chunked.py) - Time-based data splitting
- Efficient updates
-
Metadata indexing
-
Metadata Tracker (
src/ml4t-data/storage/metadata_tracker.py) - Update history
- Health monitoring
- Summary statistics
Best Practices¶
Update Frequency¶
- Daily data: Update once per day after market close
- Minute data: Update every few hours during market hours
- Crypto: Update more frequently (hourly or more)
Error Handling¶
The system includes robust error handling:
# Automatic retries with exponential backoff
@with_retry(max_attempts=3, min_wait=1.0, max_wait=30.0)
def _fetch_data_with_retry(...)
Monitoring¶
Set up monitoring using the health command:
# Cron job for daily health check
0 9 * * * ml4t-data health --stale-days 2 >> /var/log/ml4t-data-health.log
Storage Management¶
Monitor disk usage as datasets grow:
# Check data directory size
du -sh ~/.ml4t-data/data/
# List all datasets
ml4t-data list
# Remove old data if needed (manual process)
rm -rf ~/.ml4t-data/data/equities/daily/OLD_SYMBOL
Troubleshooting¶
Common Issues¶
- "No existing data found"
- Run
ml4t-data loadfirst before usingupdate -
Check the correct asset class and frequency
-
"Lock timeout"
- Another process is accessing the file
- Check for stuck processes
-
Increase timeout if needed
-
"Gaps detected"
- Normal for some data sources
- Use
--fill-gapsto automatically fill -
Check provider data quality
-
"Data is stale"
- Run update more frequently
- Check provider connectivity
- Verify market hours settings
Debug Mode¶
Enable debug logging for troubleshooting:
# Set log level in .env
echo "ML4T Data_LOG_LEVEL=DEBUG" >> .env
# Run with verbose output
ml4t-data update -s AAPL --show-status
Performance Considerations¶
Memory Usage¶
- Chunked storage keeps memory usage low
- Each chunk is processed independently
- Typical chunk size: 10-50 MB
Disk Usage¶
- Parquet compression reduces storage by 50-80%
- Monthly chunks balance size and performance
- Metadata overhead: ~1KB per dataset
Update Speed¶
- Incremental updates are 10-100x faster than full loads
- Gap detection adds minimal overhead (<1 second)
- File locking has negligible performance impact
Future Enhancements¶
Potential improvements for future sprints:
- Parallel Updates: Update multiple symbols concurrently
- Smart Scheduling: Automatic update scheduling based on asset class
- Data Validation: Detect and flag suspicious data points
- Compression Options: Support for different compression algorithms
- Archive Storage: Move old data to compressed archives
- Update Notifications: Email/webhook alerts for failures
- REST API: HTTP endpoint for remote updates
- Data Reconciliation: Compare and sync with multiple providers
Summary¶
The incremental update system provides efficient, reliable data management with:
- ✅ Minimal bandwidth usage (only fetch new data)
- ✅ Data integrity (file locking, gap detection)
- ✅ Scalability (chunked storage)
- ✅ Observability (health monitoring, update history)
- ✅ Flexibility (configurable options, multiple providers)
This foundation enables building robust quantitative trading systems with reliable, up-to-date market data.