AICurious Logo

What is: Proximal Policy Optimization?

SourceProximal Policy Optimization Algorithms
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization.

Let r_t(θ)r\_{t}\left(\theta\right) denote the probability ratio r_t(θ)=π_θ(a_ts_t)π_θ_old(a_ts_t)r\_{t}\left(\theta\right) = \frac{\pi\_{\theta}\left(a\_{t}\mid{s\_{t}}\right)}{\pi\_{\theta\_{old}}\left(a\_{t}\mid{s\_{t}}\right)}, so r(θ_old)=1r\left(\theta\_{old}\right) = 1. TRPO maximizes a “surrogate” objective:

LCPI(θ)=E^_t[π_θ(a_ts_t)π_θ_old(a_ts_t))A^_t]=E^_t[r_t(θ)A^_t]L^{\text{CPI}}\left({\theta}\right) = \hat{\mathbb{E}}\_{t}\left[\frac{\pi\_{\theta}\left(a\_{t}\mid{s\_{t}}\right)}{\pi\_{\theta\_{old}}\left(a\_{t}\mid{s\_{t}}\right)})\hat{A}\_{t}\right] = \hat{\mathbb{E}}\_{t}\left[r\_{t}\left(\theta\right)\hat{A}\_{t}\right]

Where CPICPI refers to a conservative policy iteration. Without a constraint, maximization of LCPIL^{CPI} would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move r_t(θ)r\_{t}\left(\theta\right) away from 1:

JCLIP(θ)=E^_t[min(r_t(θ)A^_t,clip(r_t(θ),1ϵ,1+ϵ)A^_t)]J^{\text{CLIP}}\left({\theta}\right) = \hat{\mathbb{E}}\_{t}\left[\min\left(r\_{t}\left(\theta\right)\hat{A}\_{t}, \text{clip}\left(r\_{t}\left(\theta\right), 1-\epsilon, 1+\epsilon\right)\hat{A}\_{t}\right)\right]

where ϵ\epsilon is a hyperparameter, say, ϵ=0.2\epsilon = 0.2. The motivation for this objective is as follows. The first term inside the min is LCPIL^{CPI}. The second term, clip(r_t(θ),1ϵ,1+ϵ)A^_t\text{clip}\left(r\_{t}\left(\theta\right), 1-\epsilon, 1+\epsilon\right)\hat{A}\_{t} modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving r_tr\_{t} outside of the interval [1ϵ,1+ϵ]\left[1 − \epsilon, 1 + \epsilon\right]. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.

One detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.