Home / Libraries / ML4T Data / Docs
ML4T Data
ML4T Data Documentation
Unified market data acquisition from 19+ providers
Skip to content

Data Quality

ml4t-data provides two complementary quality systems: OHLCV validation for structural correctness checks, and anomaly detection for statistical pattern analysis. Both can be run from code or through the CLI.

Use this page when you want to gate provider output before it reaches research, feature engineering, backtests, or production storage.

Minimal Working Example

from ml4t.data.anomaly import AnomalyManager
from ml4t.data.validation import OHLCVValidator

validation = OHLCVValidator().validate(df)
report = AnomalyManager().analyze(df, symbol="AAPL")

print(validation.passed, len(report.anomalies))

OHLCV Validation

The OHLCVValidator performs eight configurable checks on any DataFrame with standard OHLCV columns (timestamp, open, high, low, close, volume).

Checks Performed

Check Severity What it catches
Null values ERROR Missing prices or volume in any OHLCV column
Price consistency ERROR high < low, high < open, low > close, etc.
Negative prices CRITICAL Any OHLC value below zero
Negative volume ERROR Volume below zero
Duplicate timestamps ERROR Multiple rows with the same timestamp
Chronological order ERROR Timestamps not sorted ascending
Price staleness WARNING Close price unchanged for N+ consecutive days
Extreme returns WARNING Single-period return exceeding threshold

Usage

from ml4t.data.validation import OHLCVValidator

# Default settings
validator = OHLCVValidator()
result = validator.validate(df)

if not result.passed:
    for issue in result.issues:
        print(f"[{issue.severity}] {issue.check}: {issue.message}")

# Customize thresholds
validator = OHLCVValidator(
    max_return_threshold=0.3,    # flag returns > 30%
    staleness_threshold=10,      # flag 10+ unchanged days
    check_extreme_returns=False, # disable return checks entirely
)

Each ValidationIssue includes the check name, severity, affected row count, and up to 10 sample row indices for debugging.

Validation Rules and Presets

For different asset classes, use the built-in presets which adjust thresholds to match expected behavior:

from ml4t.data.validation.rules import ValidationRulePresets

# Crypto is more volatile - wider thresholds
crypto_rules = ValidationRulePresets.crypto_rules()
# max_return_threshold=0.5, staleness_threshold=3, no market hours check

# Forex is less volatile - tighter thresholds
forex_rules = ValidationRulePresets.forex_rules()
# max_return_threshold=0.1, staleness_threshold=10, 24/5 market

# Strict mode for production data
strict_rules = ValidationRulePresets.strict_rules()
# max_return_threshold=0.1, staleness_threshold=3, all checks enabled

Available presets: equity_rules(), crypto_rules(), forex_rules(), commodity_rules(), strict_rules(), relaxed_rules().

Persistent Rule Sets

Save and load validation rules per symbol or pattern:

from ml4t.data.validation.rules import ValidationRuleSet, ValidationRulePresets

ruleset = ValidationRuleSet(name="production")
ruleset.default_rule = ValidationRulePresets.equity_rules()
ruleset.add_rule("BTC*", ValidationRulePresets.crypto_rules())
ruleset.add_rule("EUR*", ValidationRulePresets.forex_rules())

# Save to YAML
ruleset.save(Path("validation_rules.yaml"))

# Load later
ruleset = ValidationRuleSet.load(Path("validation_rules.yaml"))
rules = ruleset.get_rules("BTCUSD")  # matches "BTC*" pattern

Anomaly Detection

The anomaly detection system uses statistical methods to find data quality issues that pass basic validation but indicate potential problems.

Built-in Detectors

ReturnOutlierDetector -- flags unusually large price moves.

  • Methods: MAD (default), z-score, IQR
  • MAD is robust to fat tails common in financial data
  • Severity scales with magnitude: INFO (3x) to CRITICAL (5x+)

VolumeSpikeDetector -- flags unusual volume relative to a rolling baseline.

  • Uses rolling z-score over a configurable window (default: 20 bars)
  • Filters out low-volume noise with a minimum volume threshold

PriceStalenessDetector -- flags periods where prices do not change.

  • Can check close-only or all OHLC prices
  • Severity scales with duration: INFO (5 days) to CRITICAL (20+ days)
  • Groups consecutive unchanged periods to avoid duplicate alerts

Running Anomaly Detection

from ml4t.data.anomaly import AnomalyManager, AnomalyConfig

# Default configuration (all detectors enabled)
manager = AnomalyManager()
report = manager.analyze(df, symbol="AAPL")

print(f"Found {len(report.anomalies)} anomalies")
for anomaly in report.anomalies:
    print(anomaly)
    # [WARNING] AAPL @ 2024-03-15: Unusual return of -8.42% (MAD z-score: 3.21)

# Check for critical issues
if report.has_critical_issues():
    print("Critical data quality issues found!")

# Convert to DataFrame for analysis
anomaly_df = report.to_dataframe()

Custom Configuration

from ml4t.data.anomaly import AnomalyConfig
from ml4t.data.anomaly.config import (
    ReturnOutlierConfig,
    VolumeSpikeConfig,
    PriceStalenessConfig,
)

config = AnomalyConfig(
    return_outliers=ReturnOutlierConfig(
        method="mad",       # "mad", "zscore", or "iqr"
        threshold=4.0,      # stricter than default 3.0
        min_samples=50,
    ),
    volume_spikes=VolumeSpikeConfig(
        window=30,           # 30-day rolling baseline
        threshold=4.0,
        min_volume=1000,     # ignore low-volume days
    ),
    price_staleness=PriceStalenessConfig(
        max_unchanged_days=3,
        check_close_only=True,
    ),
)

manager = AnomalyManager(config=config)

Asset-Class and Symbol Overrides

The config supports per-asset-class and per-symbol threshold overrides:

config = AnomalyConfig(
    asset_overrides={
        "crypto": {
            "return_outliers": {"threshold": 5.0},  # crypto is volatile
            "price_staleness": {"max_unchanged_days": 2},
        }
    },
    symbol_overrides={
        "BTCUSD": {
            "return_outliers": {"threshold": 6.0},  # BTC even more volatile
        }
    },
)

manager = AnomalyManager(config=config)
report = manager.analyze(df, symbol="BTCUSD", asset_class="crypto")

Batch Analysis and Reports

# Analyze multiple symbols
datasets = {"AAPL": df_aapl, "MSFT": df_msft, "GOOGL": df_googl}
reports = manager.analyze_batch(datasets)

# Save reports to disk
for symbol, report in reports.items():
    if report.anomalies:
        manager.save_report(report, Path("./anomaly_reports"))

# Filter by severity
filtered = manager.filter_by_severity(report, min_severity="warning")

# Get statistics
stats = manager.get_statistics(report)
# {"total_anomalies": 12, "by_severity": {"warning": 8, "error": 3, ...}, ...}

CLI Integration

Run validation and anomaly detection together from the command line:

# Basic validation
ml4t-data validate -s AAPL --storage-path ./data

# With anomaly detection
ml4t-data validate -s AAPL --anomalies --storage-path ./data

# Filter noise -- only show errors and critical
ml4t-data validate --all --anomalies --severity error --storage-path ./data

# Save anomaly reports for later review
ml4t-data validate --all --anomalies --save-report --storage-path ./data

The CLI returns exit code 1 if any issues are found, making it suitable for CI pipelines and pre-processing checks.

See It In The Book

The strongest book connection is the Chapter 2 data-quality workflow:

These scripts show how validation and anomaly detection fit into the broader dataset lifecycle rather than living as isolated checks.

Next Steps