Chapter 21: Reinforcement Learning

Policy Gradient Theorem and Actor-Critic Architectures

Policy gradient methods optimize parameterized policies directly, enabling the continuous action spaces and stochastic behaviors that execution and hedging demand.

Policy Gradient Theorem and Actor-Critic Architectures

Policy gradient methods optimize parameterized policies directly, enabling the continuous action spaces and stochastic behaviors that execution and hedging demand.

Why This Matters

Value-based reinforcement learning methods like Q-learning produce policies indirectly: they estimate the value of each action and then pick the best one. This works for discrete action spaces (buy, hold, sell) but breaks down when actions are continuous -- trade sizes, hedge ratios, limit-order placements -- where the set of possible actions is infinite. Policy gradient methods solve this by parameterizing the policy directly and optimizing it via gradient ascent on expected reward.

All major financial RL algorithms -- PPO for optimal execution, A2C for market making, DDPG/SAC for deep hedging -- descend from the policy gradient theorem and actor-critic architecture. Without understanding how gradients of expected reward flow through a parameterized policy, practitioners cannot diagnose why these algorithms succeed or fail in financial settings -- particularly why training instability, high-variance gradients, and noisy financial rewards create challenges that standard supervised learning does not face. This primer provides that foundation. The MDP formalism and value functions are covered in Primers 01 and 03.

Intuition

In supervised learning, you have labeled data: input-output pairs. In policy gradient RL, you have no labels. Instead, an agent takes actions, observes outcomes, and must infer which actions were good. The core challenge is credit assignment: if a trading agent executes 100 orders over a day and the total implementation shortfall is -5 bps, which orders were responsible?

The policy gradient theorem provides a principled answer. It says: increase the probability of actions that led to better-than-expected outcomes, and decrease the probability of those that led to worse-than-expected outcomes. The "better-than-expected" part is critical -- it means comparing each action's outcome to a baseline, which dramatically reduces the noise in the learning signal.

Think of it as performance review with a benchmark. Telling a trader "you made money today" is less informative than telling them "you beat the TWAP benchmark by 2 bps on these three orders." The baseline gives the gradient direction; without it, the learning signal is swamped by noise.

Formal Core

The Policy Gradient Theorem

Let $\pi_\theta(a|s)$ be a stochastic policy parameterized by $\theta$ that maps states $s$ to a probability distribution over actions $a$. Let $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^T \gamma^t r_t]$ be the expected discounted return. The policy gradient theorem, proved by Sutton et al., establishes [ref:U3YKNG9T]:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot Q^{\pi_\theta}(s_t, a_t)\right]$$

where $Q^{\pi_\theta}(s,a)$ is the action-value function under policy $\pi_\theta$. The key insight is that this gradient can be estimated from experience without differentiating the (unknown) environment transition dynamics $P(s'|s,a)$. The gradient depends only on the policy's own log-probability (the score function $\nabla_\theta \log \pi_\theta$) and the action-value, both of which are observable or estimable from sampled trajectories [ref:U3YKNG9T].

REINFORCE and the Variance Problem

The simplest instantiation is REINFORCE (Williams, 1992): replace $Q^{\pi}(s_t, a_t)$ with the Monte Carlo return $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ sampled from a complete episode. This is unbiased but suffers from high variance because every sampled return carries the accumulated noise of all future rewards. For financial applications where rewards are noisy (market returns, execution slippage), this variance can make learning prohibitively slow.

Baseline Subtraction

Subtracting a state-dependent baseline $b(s_t)$ from the return reduces variance without introducing bias, because $\mathbb{E}_{a \sim \pi}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = 0$ for any function that does not depend on $a$. The canonical choice is $b(s) = V^{\pi}(s)$, the state-value function, yielding the advantage:

$$A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s)$$

The advantage measures how much better action $a$ is than the average action under the current policy in state $s$. This is precisely the "beat the benchmark" intuition: the value function is the benchmark [ref:U3YKNG9T].

Actor-Critic Architecture

The actor-critic design, formalized by Konda and Tsitsiklis, separates two function approximators [ref:KGLQI4Y6]:

Actor $\pi_\theta(a|s)$: the policy network that selects actions
Critic $V_\phi(s)$ or $Q_\phi(s,a)$: the value network that evaluates states or state-action pairs

The critic provides a lower-variance learning signal than raw Monte Carlo returns by bootstrapping: using its own value estimates to construct TD targets rather than waiting for episode completion. The actor updates using the advantage estimated by the critic.

Generalized Advantage Estimation (GAE)

The bias-variance trade-off between Monte Carlo returns (unbiased, high variance) and one-step TD errors (biased, low variance) is controlled by GAE, which blends multi-step TD errors with an exponential weighting parameter $\lambda$ [ref:78UVNWI2]:

$$\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{T-t} (\gamma\lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$$

When $\lambda = 1$, GAE recovers the Monte Carlo advantage (high variance, no bias). When $\lambda = 0$, it uses only the one-step TD error (low variance, biased by the critic's approximation error). Values around $\lambda = 0.95$ work well in practice.

PPO: Stable Policy Updates

Proximal Policy Optimization constrains the policy update to prevent destructively large steps. It uses a clipped surrogate objective [ref:78UVNWI2]:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$

where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ is the probability ratio between the new and old policies, and $\epsilon$ (typically 0.1-0.2) controls the maximum allowed change. This prevents the catastrophic policy collapse that can occur when a large gradient step moves the policy to a region where the advantage estimates are no longer valid -- a particular risk with noisy financial rewards.

How It Works in Practice

Hambly, Xu, and Yang survey how policy gradient methods map to financial applications [ref:NUFAQNU6]: execution agents output continuous trade sizes, market-making agents output bid-ask quotes, and hedging agents output hedge ratios. In each case, the policy naturally parameterizes a continuous action distribution (often Gaussian), and the advantage function incorporates task-specific costs (implementation shortfall, inventory risk, hedging error).

The stochastic policy is a feature, not a bug, in financial settings: it enables exploration of the action space during training and can express uncertainty about optimal actions in states the agent has rarely visited.

Worked Example

One PPO update step for an execution agent:

Collect a batch of execution episodes using the current policy $\pi_{\theta_\text{old}}$
For each timestep, compute the TD residuals $\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$
Compute GAE advantages $\hat{A}_t$ using $\lambda = 0.95$
For several epochs over the batch:
Compute the probability ratio $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$
Compute the clipped surrogate loss $L^{\text{CLIP}}$
Update actor parameters $\theta$ by gradient ascent on $L^{\text{CLIP}}$
Update critic parameters $\phi$ by minimizing $(V_\phi(s_t) - G_t)^2$
Set $\theta_\text{old} \leftarrow \theta$ and collect new episodes

The clipping in step 4 ensures that even if the advantage estimates are noisy (common with financial reward signals), the policy changes gradually.

Practical Guidance

Start with PPO for financial RL. Its clipped objective provides stability under noisy rewards, and it is simpler to tune than trust-region methods.

Monitor the critic's accuracy. If the value function approximation is poor, the advantage estimates will be noisy and learning will stall. The critic loss (mean squared TD error) is a diagnostic.

Use GAE to control bias-variance. Start with $\lambda = 0.95$ and reduce if training is unstable (high variance) or increase toward 1.0 if the critic is well-calibrated.

Be cautious with off-policy methods in non-stationary environments. DDPG and SAC are more sample-efficient but assume the environment's dynamics are stationary during replay. Regime changes in financial markets violate this assumption.

Where It Fits in ML4T

Chapter 21 deploys PPO (execution), A2C (market making), and DDPG/SAC (hedging) -- all built on the policy gradient foundation covered here. Primer 01 covers the MDP formalism. Primer 03 covers value-based methods (DQN, Q-learning). Primer 05 covers reward engineering. This primer provides the bridge from value estimation to direct policy optimization, explaining why Chapter 21's applications require policy gradient methods rather than the value-based approaches introduced earlier.

Register to Read

Create Free Account

Already have an account? Sign in

Chapter

21 Reinforcement Learning

More Primers

Distributional RL and Risk Measures