Predictable GRPO
The paper introduces Predictable GRPO, a modification to Group Relative Policy Optimization (GRPO) that improves training stability and performance in reinforcement learning for language models by using a predictive baseline to reduce variance in advantage estimation.
Background
- This paper introduces "Predictable GRPO," a technique to make the training of large language models (LLMs) using reinforcement learning more stable, efficient, and easier to reproduce. GRPO (Group Relative Policy Optimization) is a method used to fine-tune LLMs, famously used in training DeepSeek-R1, a model that gained notoriety for rivaling top AI systems like OpenAI's o1. The core problem with GRPO was that it required careful tuning of hyperparameters (settings) and could be finicky or unstable in practice. Predictable GRPO proposes a reformulated version that fixes these issues, making training more deterministic and predictable without sacrificing performance. This matters because it lowers the barrier for researchers and companies to apply advanced reinforcement learning to LLMs, potentially accelerating progress toward more capable AI systems.