ChatPaper.aiChatPaper

赋能自我提升推理者的认知行为,或高效STaRs的四大习惯

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

March 3, 2025
作者: Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman
cs.AI

摘要

测试时推理已成为一种强大的范式,使语言模型能够像熟练的人类专家一样,对复杂挑战进行更长时间、更细致的“思考”。虽然强化学习(RL)可以推动语言模型在可验证任务上的自我提升,但一些模型表现出显著进步,而另一些则迅速达到瓶颈。例如,我们发现,在相同的RL训练下,Qwen-2.5-3B在“倒计时”游戏中的表现远超Llama-3.2-3B。这一差异引发了一个关键问题:哪些内在特性促成了有效的自我提升?我们引入了一个框架,通过分析四种关键认知行为——验证、回溯、子目标设定和逆向链式推理——来探讨这一问题,这些行为既是人类专家问题解决者也是成功语言模型所采用的。我们的研究表明,Qwen自然展现出这些推理行为,而Llama最初则缺乏这些能力。在系统化的行为数据集控制实验中,我们发现,通过向Llama提供包含这些推理行为的示例进行引导,能够在RL过程中实现显著改进,其表现与Qwen相当甚至超越。重要的是,推理行为的存在,而非答案的正确性,被证明是关键因素——使用包含正确推理模式但答案错误的解决方案进行引导的模型,其表现与使用正确解决方案训练的模型相当。最后,利用OpenWebMath数据进行持续预训练,并通过过滤增强推理行为,使Llama模型能够匹配Qwen的自我提升轨迹。我们的研究结果确立了初始推理行为与提升能力之间的基本关系,解释了为何一些语言模型能有效利用额外计算资源而另一些则停滞不前。
English
Test-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.

Summary

AI-Generated Summary

PDF313March 4, 2025