Translation

Discretizing Reward Models

This paper introduces a method for discretizing reward models, aiming to improve the efficiency and interpretability of reinforcement learning from human feedback (RLHF) by converting continuous reward signals into discrete categories.

Background

This paper introduces a method for "discretizing" reward models used in reinforcement learning from human feedback (RLHF), the technique behind aligning large language models (like ChatGPT) with user preferences. Normally, reward models output a continuous score; the authors argue that discretizing these scores into discrete categories (e.g., "bad", "neutral", "good") makes the reward model more robust and less prone to over-optimization, where a model learns to game the reward signal instead of actually improving. The work builds on known issues with RLHF reward hacking and proposes a simple modification.