Chapter 5

Synthetic Financial Data

7 sections 8 notebooks 20 references Code

Learning Objectives

Explain why trading research is path-limited and how adaptive search and multiple testing can inflate apparent backtest performance.
Use classical simulation baselines, including bootstrap and stochastic volatility models, as interpretable benchmarks for synthetic data generation.
Select a synthetic-data approach that matches the data structure and downstream objective, including learned generators for time series and tabular financial data.
Diagnose generated data using stylized-fact, dependence, and task-based evaluation methods, including Train-Synthetic-Test-Real comparisons.
Assess privacy and generator-specific risks, including leakage, bias amplification, overfitting to the generator, and limited scenario novelty.

5.1

The Quant's Dilemma

Quantitative strategy development must make inferences from a single realized history with limited crises, regime shifts, and correlation breakdowns. After testing 100 configurations with zero true Sharpe, the expected maximum in-sample Sharpe exceeds 2.5 (Bailey et al. 2015), making backtest overfitting near-certain. Synthetic data is positioned as simulation infrastructure that turns one realized history into a distribution of plausible histories — but the generator must reproduce tail outcomes and dependence structure, not just bulk distributional similarity.

5.2

Classical Simulation Baselines

Bootstrap methods and parametric stochastic models (GBM, jump-diffusion, Heston, GARCH) remain strong reference points for evaluating learned generators. Bootstrap variants range from IID (preserves marginals, destroys temporal dependence) through stationary bootstrap (preserves short-range dependence). Any deep generative model should at least outperform these baselines on the diagnostics that matter for the downstream task.

1 notebook

5.3

Generative Model Taxonomy

Four families of learned generators are introduced: variational autoencoders (stable but may oversmooth), GANs (sharp but unstable), diffusion models (stable with iterative denoising), and LLM-based tabular generators (serialize rows as text). The key distinction is that learned generators can represent complex dependence structures that are difficult to specify parametrically, but can also fail silently by smoothing tails or collapsing modes.

5.4

GANs for Financial Time Series

Four GAN variants address specific limitations: TimeGAN adds supervised temporal objectives (TSTR ratio 1.70), Tail-GAN augments with VaR/ES penalties (reducing VaR error from 102% to 11.5%), Sig-CWGAN uses path-signature kernels for temporal fidelity (TSTR ratio 0.97), and GT-GAN uses neural ODEs for irregularly sampled data. Honest results and shared challenges (mode collapse, training instability) are documented.

4 notebooks

5.5

Diffusion Models for Financial Time Series

Diffusion-TS uses a Transformer encoder-decoder with trend-plus-seasonal decomposition, achieving KS statistic 0.06 and TSTR ratio 1.00 on 20 ETFs. Classifier-guided conditional generation enables regime-specific scenario production with a 2.6x volatility ratio between generated low and high-volatility samples. The key tradeoff is computational cost versus training stability.

1 notebook

5.6

LLMs for Structured Financial Data

Autoregressive language models generate synthetic tabular data by serializing rows as text. Using the GReaT framework with distilgpt2, fine-tuned in ~10 minutes on GPU, the approach achieves TSTR AUC-ROC of 0.84 and KS statistics below 0.035. Specific failure modes include invalid records from autoregressive generation and numerical fidelity limitations from token-level optimization.

1 notebook

5.7

The Fidelity–Utility–Privacy Framework

A three-dimensional evaluation framework: fidelity (marginal + temporal structure), utility (TSTR benchmarks), and privacy (empirical leakage + differential privacy). Applied to DP-GAN, strong privacy (epsilon=1) degrades fidelity by 6x while epsilon in [5,10] offers a practical sweet spot. The minimum validation checklist requires one distributional metric, one task benchmark, and one leakage test.

1 notebook

All Chapters

Synthetic Financial Data

Learning Objectives

The Quant's Dilemma

Classical Simulation Baselines

Generative Model Taxonomy

GANs for Financial Time Series

Diffusion Models for Financial Time Series

LLMs for Structured Financial Data

The Fidelity–Utility–Privacy Framework

Bootstrap Methods for Dependent Financial Time Series

Stochastic Volatility, Jumps, and GARCH as Financial Simulation Baselines