B-STaR:自学推理器中探索与利用的监控和平衡
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
December 23, 2024
作者: Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He
cs.AI
摘要
在复杂推理任务缺乏大量人工标注数据的情况下,自我改进成为增强性能的主要方法,即模型在自身输出上进行训练。然而,这些迭代式自我改进方法背后的关键因素仍然知之甚少,比如在什么条件下自我改进有效,当前迭代中存在哪些瓶颈等。在这项工作中,我们确定并提出了监控这一迭代过程中两个关键因素的方法:(1)模型生成足够多样化响应的能力(探索);以及(2)外部奖励在区分高质量候选者和低质量候选者方面的有效性(开发)。以数学推理为案例研究,我们首先进行定量分析以跟踪探索和开发的动态,发现模型的探索能力在迭代过程中迅速恶化,而利用外部奖励进行开发的有效性也在减弱。受到这些发现的启发,我们引入了B-STaR,一个自学习推理框架,它在迭代中自主调整配置以平衡探索和开发,从而基于当前策略模型和可用奖励优化自我改进的效果。我们在数学推理、编码和常识推理上的实验表明,B-STaR不仅通过训练全面增强了模型的探索能力,而且实现了更有效的探索和开发平衡,从而实现了卓越的性能。
English
In the absence of extensive human-annotated data for complex reasoning tasks,
self-improvement -- where models are trained on their own outputs -- has
emerged as a primary method for enhancing performance. However, the critical
factors underlying the mechanism of these iterative self-improving methods
remain poorly understood, such as under what conditions self-improvement is
effective, and what are the bottlenecks in the current iterations. In this
work, we identify and propose methods to monitor two pivotal factors in this
iterative process: (1) the model's ability to generate sufficiently diverse
responses (exploration); and (2) the effectiveness of external rewards in
distinguishing high-quality candidates from lower-quality ones (exploitation).
Using mathematical reasoning as a case study, we begin with a quantitative
analysis to track the dynamics of exploration and exploitation, discovering
that a model's exploratory capabilities rapidly deteriorate over iterations,
and the effectiveness of exploiting external rewards diminishes as well.
Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning
framework that autonomously adjusts configurations across iterations to Balance
exploration and exploitation, thereby optimizing the self-improving
effectiveness based on the current policy model and available rewards. Our
experiments on mathematical reasoning, coding, and commonsense reasoning
demonstrate that B-STaR not only enhances the model's exploratory capabilities
throughout training but also achieves a more effective balance between
exploration and exploitation, leading to superior performance.Summary
AI-Generated Summary