Group Relative Policy Optimization

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm introduced by researchers at DeepSeek in 2024.^[1] The algorithm modifies the widely used Proximal Policy Optimization (PPO) approach by eliminating the critic network and instead computing advantage estimates from reward statistics within each batch of sampled actions.

Method

Traditional PPO implementations use an actor-critic architecture with separate policy and value networks. GRPO removes the value network entirely, reducing computational overhead and memory requirements during training.^[1]

For a given state, GRPO samples multiple actions and computes advantages by comparing each action's reward to the group statistics. The advantage function is:

$A^{π_{θ_{t}}} (s, a_{j}) = \frac{r (s, a_{j}) - μ}{σ}$

where $μ$ and $σ$ are the mean and standard deviation of rewards within the sampled group. This normalization ensures that advantages are computed relative to the current batch rather than requiring a separate value function approximation.

The policy update uses a clipped objective similar to PPO:

$ℒ_{GRPO} (θ) = \frac{1}{G} \sum_{i = 1}^{G} clip (ρ_{i}, 1 - ϵ, 1 + ϵ) A_{i} - β D_{K L} (π_{θ_{t}} ‖ π_{θ})$

where $ρ_{i}$ represents the probability ratio between current and old policies, and the KL divergence term prevents excessive policy changes.

Applications

GRPO was first applied to train mathematical reasoning models, including the DeepSeekMath 7B model.^[1] The algorithm has since been used in training the DeepSeek-R1 series, which demonstrated improved performance on reasoning benchmarks.^[2]

Several machine learning frameworks have incorporated GRPO implementations, including the Hugging Face Transformers Reinforcement Learning (TRL) library and Unsloth's fine-tuning toolkit.

References

↑ ^1.0 ^1.1 ^1.2 Shao, Zhihong; Wang, Peiyi; Zhu, Qihao et al. (2024-02-05). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". arXiv:2402.03300 [cs.CL].CS1 maint: Multiple names: authors list (link)
↑ DeepSeek-AI; et al. (2025-01-22). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arXiv:2501.12948 [cs.CL].

This article "Group Relative Policy Optimization" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Group Relative Policy Optimization. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[deepseekmath-1] 1.0 ^1.1 ^1.2 Shao, Zhihong; Wang, Peiyi; Zhu, Qihao et al. (2024-02-05). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". arXiv:2402.03300 [cs.CL].CS1 maint: Multiple names: authors list (link)

[r1-2] DeepSeek-AI; et al. (2025-01-22). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arXiv:2501.12948 [cs.CL].

[1]

[2]

Group Relative Policy Optimization

Contents

Method

Applications

See also

References