多尝试强化学习中的失败经验学习
Learning from Failures in Multi-Attempt Reinforcement Learning
March 4, 2025
作者: Stephen Chung, Wenyu Du, Jie Fu
cs.AI
摘要
近期,针对大规模语言模型(LLMs)的强化学习(RL)研究取得了显著进展,以DeepSeek R1为例,研究表明即便是简单的问答任务也能大幅提升LLM的推理能力。在本研究中,我们通过将任务调整为多轮尝试设置,进一步拓展了这一方法。模型不再对每个问题仅生成单一回答,而是获得多次尝试机会,并在错误回答后提供反馈。这种多轮尝试任务促使模型优化其先前尝试,并提高搜索效率。实验结果显示,即便是小型LLM,在多轮尝试任务上训练后,在评估时给予更多尝试机会也能显著提升准确率,在数学基准测试中,从单次尝试的45.6%提升至两次尝试的52.5%。相比之下,同一LLM在标准单轮任务上训练后,在评估时给予更多尝试机会仅表现出微小的改进,从42.3%增至43.2%。这些结果表明,与标准单轮任务相比,经过多轮尝试任务训练的LLM在数学基准测试上表现略优,同时还能更有效地基于用户反馈精炼其回答。完整代码已发布于https://github.com/DualityRL/multi-attempt。
English
Recent advancements in reinforcement learning (RL) for large language models
(LLMs), exemplified by DeepSeek R1, have shown that even a simple
question-answering task can substantially improve an LLM's reasoning
capabilities. In this work, we extend this approach by modifying the task into
a multi-attempt setting. Instead of generating a single response per question,
the model is given multiple attempts, with feedback provided after incorrect
responses. The multi-attempt task encourages the model to refine its previous
attempts and improve search efficiency. Experimental results show that even a
small LLM trained on a multi-attempt task achieves significantly higher
accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt
to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM
trained on a standard single-turn task exhibits only a marginal improvement,
increasing from 42.3% to 43.2% when given more attempts during evaluation. The
results indicate that, compared to the standard single-turn task, an LLM
trained on a multi-attempt task achieves slightly better performance on math
benchmarks while also learning to refine its responses more effectively based
on user feedback. Full code is available at
https://github.com/DualityRL/multi-attemptSummary
AI-Generated Summary