ChatPaper.aiChatPaper

多尝试强化学习中的失败经验学习

Learning from Failures in Multi-Attempt Reinforcement Learning

March 4, 2025
作者: Stephen Chung, Wenyu Du, Jie Fu
cs.AI

摘要

近期,针对大规模语言模型(LLMs)的强化学习(RL)研究取得了显著进展,以DeepSeek R1为例,研究表明即便是简单的问答任务也能大幅提升LLM的推理能力。在本研究中,我们通过将任务调整为多轮尝试设置,进一步拓展了这一方法。模型不再对每个问题仅生成单一回答,而是获得多次尝试机会,并在错误回答后提供反馈。这种多轮尝试任务促使模型优化其先前尝试,并提高搜索效率。实验结果显示,即便是小型LLM,在多轮尝试任务上训练后,在评估时给予更多尝试机会也能显著提升准确率,在数学基准测试中,从单次尝试的45.6%提升至两次尝试的52.5%。相比之下,同一LLM在标准单轮任务上训练后,在评估时给予更多尝试机会仅表现出微小的改进,从42.3%增至43.2%。这些结果表明,与标准单轮任务相比,经过多轮尝试任务训练的LLM在数学基准测试上表现略优,同时还能更有效地基于用户反馈精炼其回答。完整代码已发布于https://github.com/DualityRL/multi-attempt。
English
Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt

Summary

AI-Generated Summary

PDF172March 10, 2025