Home / Libraries / ML4T Data / Docs
ML4T Data
ML4T Data Documentation
Unified market data acquisition from 19+ providers
Skip to content

Incremental Updates & Data Management

This document describes the incremental update system implemented in Sprint 004, including gap detection, file locking, chunked storage, and metadata tracking.

Overview

The incremental update system allows efficient data updates without re-downloading entire datasets. Key features include:

  • Incremental Updates: Only fetch new data since last update
  • Gap Detection: Identify and fill missing data points
  • File Locking: Safe concurrent access to data files
  • Chunked Storage: Split large datasets into manageable time-based chunks
  • Metadata Tracking: Monitor dataset health and update history

CLI Commands

Initial Data Load

First time loading data for a symbol:

# Load historical data
ml4t-data load --provider yahoo --symbol AAPL --start 2023-01-01 --end 2024-01-01

# Load with specific frequency
ml4t-data load -p yahoo -s AAPL --start 2023-01-01 --end 2024-01-01 -f daily

Incremental Updates

Update existing data with new data:

# Basic update (uses existing provider)
ml4t-data update --symbol AAPL

# Update with options
ml4t-data update -s AAPL --lookback-days 10 --fill-gaps --show-status

# Update without gap filling
ml4t-data update -s AAPL --no-fill-gaps

# Update with different provider
ml4t-data update -s AAPL --provider yahoo

Options: - --lookback-days/-l: Days to look back for validation (default: 7) - --fill-gaps/--no-fill-gaps: Enable/disable gap filling (default: enabled) - --show-status: Display detailed update status and history - --provider/-p: Override provider (uses existing if not specified)

Health Monitoring

Check health status of all datasets:

# Basic health check
ml4t-data health

# Detailed health check
ml4t-data health --verbose

# Custom staleness threshold
ml4t-data health --stale-days 3 --verbose

Output shows: - Total datasets and their health status (✅ healthy, ⚠️ stale, ❌ error) - Total rows across all datasets - Breakdown by asset class - Individual dataset details (with --verbose)

Workflow Examples

Example 1: Daily Stock Data Updates

# Initial load of AAPL data
ml4t-data load -p yahoo -s AAPL --start 2023-01-01 --end 2024-01-01

# Daily update (run via cron)
ml4t-data update -s AAPL --show-status

# Check health weekly
ml4t-data health --verbose

Example 2: Multiple Symbol Management

# Load multiple symbols
for symbol in AAPL GOOGL MSFT NVDA; do
    ml4t-data load -p yahoo -s $symbol --start 2023-01-01 --end 2024-01-01
done

# Update all symbols
for symbol in AAPL GOOGL MSFT NVDA; do
    ml4t-data update -s $symbol
done

# Check overall health
ml4t-data health

Example 3: Crypto Data (24/7 Trading)

# Load crypto data
ml4t-data load -p yahoo -s BTC-USD --start 2023-01-01 --end 2024-01-01 -a crypto

# Update with gap detection (important for 24/7 markets)
ml4t-data update -s BTC-USD -a crypto --fill-gaps --show-status

Technical Details

Gap Detection

The system distinguishes between expected and unexpected gaps:

  • Stock Markets: Weekends and after-hours are expected gaps
  • Crypto Markets: 24/7 trading, all gaps are unexpected
  • Configurable Tolerance: 10% default tolerance for gap detection

Gap filling methods: - forward: Use last known value (default) - backward: Use next known value - interpolate: Linear interpolation - zero: Fill with zeros

File Locking

Thread-safe and process-safe file locking ensures data integrity:

  • Uses filelock library for cross-platform compatibility
  • Automatic lock acquisition and release
  • Configurable timeout (default: 30 seconds)
  • Prevents corruption during concurrent reads/writes

Chunked Storage

Large datasets are split into time-based chunks:

  • Monthly chunks (default): 30-day periods
  • Weekly chunks: 7-day periods
  • Quarterly chunks: 90-day periods
  • Yearly chunks: 365-day periods

Benefits: - Efficient incremental updates (only update relevant chunks) - Parallel processing capability - Reduced memory usage for large datasets - Fast time-range queries

Metadata Tracking

Each dataset maintains metadata including:

  • Update history (last 100 updates)
  • Health status (healthy/stale/error)
  • Data range and row count
  • Provider information
  • Error tracking

Health checks consider: - Days since last update - Data currency (how far behind current date) - Recent error frequency

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Provider  │────▶│   Pipeline   │────▶│   Storage   │
└─────────────┘     └──────────────┘     └─────────────┘
                            │                     │
                            ▼                     ▼
                    ┌──────────────┐     ┌─────────────┐
                    │ Gap Detector │     │ File Lock   │
                    └──────────────┘     └─────────────┘
                            │                     │
                            ▼                     ▼
                    ┌──────────────┐     ┌─────────────┐
                    │   Metadata   │     │   Chunks    │
                    │   Tracker    │     │   Storage   │
                    └──────────────┘     └─────────────┘

Components

  1. Pipeline (src/ml4t-data/pipeline.py)
  2. Orchestrates data flow
  3. run_load(): Full data load
  4. run_update(): Incremental update

  5. Gap Detector (src/ml4t-data/utils/gaps.py)

  6. Identifies missing data points
  7. Market hours awareness
  8. Multiple fill strategies

  9. File Locking (src/ml4t-data/utils/locking.py)

  10. Thread/process-safe access
  11. Automatic cleanup
  12. Timeout configuration

  13. Chunked Storage (src/ml4t-data/storage/chunked.py)

  14. Time-based data splitting
  15. Efficient updates
  16. Metadata indexing

  17. Metadata Tracker (src/ml4t-data/storage/metadata_tracker.py)

  18. Update history
  19. Health monitoring
  20. Summary statistics

Best Practices

Update Frequency

  • Daily data: Update once per day after market close
  • Minute data: Update every few hours during market hours
  • Crypto: Update more frequently (hourly or more)

Error Handling

The system includes robust error handling:

# Automatic retries with exponential backoff
@with_retry(max_attempts=3, min_wait=1.0, max_wait=30.0)
def _fetch_data_with_retry(...)

Monitoring

Set up monitoring using the health command:

# Cron job for daily health check
0 9 * * * ml4t-data health --stale-days 2 >> /var/log/ml4t-data-health.log

Storage Management

Monitor disk usage as datasets grow:

# Check data directory size
du -sh ~/.ml4t-data/data/

# List all datasets
ml4t-data list

# Remove old data if needed (manual process)
rm -rf ~/.ml4t-data/data/equities/daily/OLD_SYMBOL

Troubleshooting

Common Issues

  1. "No existing data found"
  2. Run ml4t-data load first before using update
  3. Check the correct asset class and frequency

  4. "Lock timeout"

  5. Another process is accessing the file
  6. Check for stuck processes
  7. Increase timeout if needed

  8. "Gaps detected"

  9. Normal for some data sources
  10. Use --fill-gaps to automatically fill
  11. Check provider data quality

  12. "Data is stale"

  13. Run update more frequently
  14. Check provider connectivity
  15. Verify market hours settings

Debug Mode

Enable debug logging for troubleshooting:

# Set log level in .env
echo "ML4T Data_LOG_LEVEL=DEBUG" >> .env

# Run with verbose output
ml4t-data update -s AAPL --show-status

Performance Considerations

Memory Usage

  • Chunked storage keeps memory usage low
  • Each chunk is processed independently
  • Typical chunk size: 10-50 MB

Disk Usage

  • Parquet compression reduces storage by 50-80%
  • Monthly chunks balance size and performance
  • Metadata overhead: ~1KB per dataset

Update Speed

  • Incremental updates are 10-100x faster than full loads
  • Gap detection adds minimal overhead (<1 second)
  • File locking has negligible performance impact

Future Enhancements

Potential improvements for future sprints:

  1. Parallel Updates: Update multiple symbols concurrently
  2. Smart Scheduling: Automatic update scheduling based on asset class
  3. Data Validation: Detect and flag suspicious data points
  4. Compression Options: Support for different compression algorithms
  5. Archive Storage: Move old data to compressed archives
  6. Update Notifications: Email/webhook alerts for failures
  7. REST API: HTTP endpoint for remote updates
  8. Data Reconciliation: Compare and sync with multiple providers

Summary

The incremental update system provides efficient, reliable data management with:

  • ✅ Minimal bandwidth usage (only fetch new data)
  • ✅ Data integrity (file locking, gap detection)
  • ✅ Scalability (chunked storage)
  • ✅ Observability (health monitoring, update history)
  • ✅ Flexibility (configurable options, multiple providers)

This foundation enables building robust quantitative trading systems with reliable, up-to-date market data.