B-STaR:在自我學習推理者中監控和平衡探索與利用
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
December 23, 2024
作者: Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He
cs.AI
摘要
在缺乏複雜推理任務的大量人工標註數據的情況下,自我改進——即模型在自身輸出上進行訓練——已成為增強性能的主要方法。然而,這些迭代式自我改進方法背後的關鍵因素仍知之甚少,例如自我改進在何種條件下有效,以及當前迭代中存在的瓶頸是什麼。在這項工作中,我們確定並提出方法來監控這個迭代過程中的兩個關鍵因素:(1) 模型生成足夠多樣化回應的能力(探索);以及(2) 外部獎勵在區分高質量候選者和低質量候選者方面的有效性(利用)。以數學推理為案例研究,我們從定量分析入手,追蹤探索和利用的動態,發現模型的探索能力在迭代過程中迅速惡化,利用外部獎勵的有效性也隨之降低。受到這些發現的啟發,我們引入了B-STaR,一個自我學習推理框架,可以自主調整迭代中的配置,平衡探索和利用,從而根據當前策略模型和可用獎勵優化自我改進的效果。我們在數學推理、編碼和常識推理上的實驗表明,B-STaR不僅在整個訓練過程中增強了模型的探索能力,還實現了探索和利用之間更有效的平衡,從而提高了性能。
English
In the absence of extensive human-annotated data for complex reasoning tasks,
self-improvement -- where models are trained on their own outputs -- has
emerged as a primary method for enhancing performance. However, the critical
factors underlying the mechanism of these iterative self-improving methods
remain poorly understood, such as under what conditions self-improvement is
effective, and what are the bottlenecks in the current iterations. In this
work, we identify and propose methods to monitor two pivotal factors in this
iterative process: (1) the model's ability to generate sufficiently diverse
responses (exploration); and (2) the effectiveness of external rewards in
distinguishing high-quality candidates from lower-quality ones (exploitation).
Using mathematical reasoning as a case study, we begin with a quantitative
analysis to track the dynamics of exploration and exploitation, discovering
that a model's exploratory capabilities rapidly deteriorate over iterations,
and the effectiveness of exploiting external rewards diminishes as well.
Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning
framework that autonomously adjusts configurations across iterations to Balance
exploration and exploitation, thereby optimizing the self-improving
effectiveness based on the current policy model and available rewards. Our
experiments on mathematical reasoning, coding, and commonsense reasoning
demonstrate that B-STaR not only enhances the model's exploratory capabilities
throughout training but also achieves a more effective balance between
exploration and exploitation, leading to superior performance.Summary
AI-Generated Summary