多尝试强化学习中的失败经验学习

摘要

近期，针对大规模语言模型（LLMs）的强化学习（RL）研究取得了显著进展，以DeepSeek R1为例，研究表明即便是简单的问答任务也能大幅提升LLM的推理能力。在本研究中，我们通过将任务调整为多轮尝试设置，进一步拓展了这一方法。模型不再对每个问题仅生成单一回答，而是获得多次尝试机会，并在错误回答后提供反馈。这种多轮尝试任务促使模型优化其先前尝试，并提高搜索效率。实验结果显示，即便是小型LLM，在多轮尝试任务上训练后，在评估时给予更多尝试机会也能显著提升准确率，在数学基准测试中，从单次尝试的45.6%提升至两次尝试的52.5%。相比之下，同一LLM在标准单轮任务上训练后，在评估时给予更多尝试机会仅表现出微小的改进，从42.3%增至43.2%。这些结果表明，与标准单轮任务相比，经过多轮尝试任务训练的LLM在数学基准测试上表现略优，同时还能更有效地基于用户反馈精炼其回答。完整代码已发布于https://github.com/DualityRL/multi-attempt。

English

Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt

多尝试强化学习中的失败经验学习

Learning from Failures in Multi-Attempt Reinforcement Learning

摘要

Summary

Support

Support