Show HN: We trained a 32B model to beat Opus 4 at credit card optimization
Researchers trained a 32B Qwen model using GRPO reinforcement learning to optimize credit card rewards. The model achieved a score of 0.51 on held-out tasks, outperforming Opus 4 at 0.41 and GPT-4o at 0.36. The training environment is open source under Apache 2.0 license.