Data Quality¶
ml4t-data provides two complementary quality systems: OHLCV validation for structural correctness checks, and anomaly detection for statistical pattern analysis. Both can be run from code or through the CLI.
Use this page when you want to gate provider output before it reaches research, feature engineering, backtests, or production storage.
Minimal Working Example¶
from ml4t.data.anomaly import AnomalyManager
from ml4t.data.validation import OHLCVValidator
validation = OHLCVValidator().validate(df)
report = AnomalyManager().analyze(df, symbol="AAPL")
print(validation.passed, len(report.anomalies))
OHLCV Validation¶
The OHLCVValidator performs eight configurable checks on any DataFrame with
standard OHLCV columns (timestamp, open, high, low, close, volume).
Checks Performed¶
| Check | Severity | What it catches |
|---|---|---|
| Null values | ERROR | Missing prices or volume in any OHLCV column |
| Price consistency | ERROR | high < low, high < open, low > close, etc. |
| Negative prices | CRITICAL | Any OHLC value below zero |
| Negative volume | ERROR | Volume below zero |
| Duplicate timestamps | ERROR | Multiple rows with the same timestamp |
| Chronological order | ERROR | Timestamps not sorted ascending |
| Price staleness | WARNING | Close price unchanged for N+ consecutive days |
| Extreme returns | WARNING | Single-period return exceeding threshold |
Usage¶
from ml4t.data.validation import OHLCVValidator
# Default settings
validator = OHLCVValidator()
result = validator.validate(df)
if not result.passed:
for issue in result.issues:
print(f"[{issue.severity}] {issue.check}: {issue.message}")
# Customize thresholds
validator = OHLCVValidator(
max_return_threshold=0.3, # flag returns > 30%
staleness_threshold=10, # flag 10+ unchanged days
check_extreme_returns=False, # disable return checks entirely
)
Each ValidationIssue includes the check name, severity, affected row count,
and up to 10 sample row indices for debugging.
Validation Rules and Presets¶
For different asset classes, use the built-in presets which adjust thresholds to match expected behavior:
from ml4t.data.validation.rules import ValidationRulePresets
# Crypto is more volatile - wider thresholds
crypto_rules = ValidationRulePresets.crypto_rules()
# max_return_threshold=0.5, staleness_threshold=3, no market hours check
# Forex is less volatile - tighter thresholds
forex_rules = ValidationRulePresets.forex_rules()
# max_return_threshold=0.1, staleness_threshold=10, 24/5 market
# Strict mode for production data
strict_rules = ValidationRulePresets.strict_rules()
# max_return_threshold=0.1, staleness_threshold=3, all checks enabled
Available presets: equity_rules(), crypto_rules(), forex_rules(),
commodity_rules(), strict_rules(), relaxed_rules().
Persistent Rule Sets¶
Save and load validation rules per symbol or pattern:
from ml4t.data.validation.rules import ValidationRuleSet, ValidationRulePresets
ruleset = ValidationRuleSet(name="production")
ruleset.default_rule = ValidationRulePresets.equity_rules()
ruleset.add_rule("BTC*", ValidationRulePresets.crypto_rules())
ruleset.add_rule("EUR*", ValidationRulePresets.forex_rules())
# Save to YAML
ruleset.save(Path("validation_rules.yaml"))
# Load later
ruleset = ValidationRuleSet.load(Path("validation_rules.yaml"))
rules = ruleset.get_rules("BTCUSD") # matches "BTC*" pattern
Anomaly Detection¶
The anomaly detection system uses statistical methods to find data quality issues that pass basic validation but indicate potential problems.
Built-in Detectors¶
ReturnOutlierDetector -- flags unusually large price moves.
- Methods: MAD (default), z-score, IQR
- MAD is robust to fat tails common in financial data
- Severity scales with magnitude: INFO (3x) to CRITICAL (5x+)
VolumeSpikeDetector -- flags unusual volume relative to a rolling baseline.
- Uses rolling z-score over a configurable window (default: 20 bars)
- Filters out low-volume noise with a minimum volume threshold
PriceStalenessDetector -- flags periods where prices do not change.
- Can check close-only or all OHLC prices
- Severity scales with duration: INFO (5 days) to CRITICAL (20+ days)
- Groups consecutive unchanged periods to avoid duplicate alerts
Running Anomaly Detection¶
from ml4t.data.anomaly import AnomalyManager, AnomalyConfig
# Default configuration (all detectors enabled)
manager = AnomalyManager()
report = manager.analyze(df, symbol="AAPL")
print(f"Found {len(report.anomalies)} anomalies")
for anomaly in report.anomalies:
print(anomaly)
# [WARNING] AAPL @ 2024-03-15: Unusual return of -8.42% (MAD z-score: 3.21)
# Check for critical issues
if report.has_critical_issues():
print("Critical data quality issues found!")
# Convert to DataFrame for analysis
anomaly_df = report.to_dataframe()
Custom Configuration¶
from ml4t.data.anomaly import AnomalyConfig
from ml4t.data.anomaly.config import (
ReturnOutlierConfig,
VolumeSpikeConfig,
PriceStalenessConfig,
)
config = AnomalyConfig(
return_outliers=ReturnOutlierConfig(
method="mad", # "mad", "zscore", or "iqr"
threshold=4.0, # stricter than default 3.0
min_samples=50,
),
volume_spikes=VolumeSpikeConfig(
window=30, # 30-day rolling baseline
threshold=4.0,
min_volume=1000, # ignore low-volume days
),
price_staleness=PriceStalenessConfig(
max_unchanged_days=3,
check_close_only=True,
),
)
manager = AnomalyManager(config=config)
Asset-Class and Symbol Overrides¶
The config supports per-asset-class and per-symbol threshold overrides:
config = AnomalyConfig(
asset_overrides={
"crypto": {
"return_outliers": {"threshold": 5.0}, # crypto is volatile
"price_staleness": {"max_unchanged_days": 2},
}
},
symbol_overrides={
"BTCUSD": {
"return_outliers": {"threshold": 6.0}, # BTC even more volatile
}
},
)
manager = AnomalyManager(config=config)
report = manager.analyze(df, symbol="BTCUSD", asset_class="crypto")
Batch Analysis and Reports¶
# Analyze multiple symbols
datasets = {"AAPL": df_aapl, "MSFT": df_msft, "GOOGL": df_googl}
reports = manager.analyze_batch(datasets)
# Save reports to disk
for symbol, report in reports.items():
if report.anomalies:
manager.save_report(report, Path("./anomaly_reports"))
# Filter by severity
filtered = manager.filter_by_severity(report, min_severity="warning")
# Get statistics
stats = manager.get_statistics(report)
# {"total_anomalies": 12, "by_severity": {"warning": 8, "error": 3, ...}, ...}
CLI Integration¶
Run validation and anomaly detection together from the command line:
# Basic validation
ml4t-data validate -s AAPL --storage-path ./data
# With anomaly detection
ml4t-data validate -s AAPL --anomalies --storage-path ./data
# Filter noise -- only show errors and critical
ml4t-data validate --all --anomalies --severity error --storage-path ./data
# Save anomaly reports for later review
ml4t-data validate --all --anomalies --save-report --storage-path ./data
The CLI returns exit code 1 if any issues are found, making it suitable for CI pipelines and pre-processing checks.
See It In The Book¶
The strongest book connection is the Chapter 2 data-quality workflow:
These scripts show how validation and anomaly detection fit into the broader dataset lifecycle rather than living as isolated checks.