B-STaR: 自己学習型Reasonersにおける探索と活用の監視とバランス調整

要旨

複雑な推論タスクのための十分な人手による注釈付きデータがない場合、自己改善、つまりモデルが自身の出力によって訓練される方法が、性能向上の主要な手法として登場しています。ただし、これらの反復的な自己改善方法のメカニズムの根幹となる要因は、自己改善が効果的である条件や、現在の反復におけるボトルネックなど、依然として十分に理解されていません。本研究では、この反復プロセスにおける2つの重要な要因を監視し、提案する方法を特定します。それは、(1) モデルが十分に多様な応答を生成する能力（探索）と、(2) 外部報酬が高品質な候補と低品質な候補を区別する効果（活用）です。数学的推論を事例として用い、探索と活用のダイナミクスを追跡するための定量的分析を開始しました。その結果、モデルの探索能力が反復ごとに急速に低下し、外部報酬を活用する効果も低下することが明らかとなりました。これらの知見に基づいて、現在のポリシーモデルと利用可能な報酬に基づいて、探索と活用をバランスよく調整する自己学習推論フレームワークであるB-STaRを導入します。数学的推論、コーディング、常識的推論に関する実験では、B-STaRがトレーニング全体でモデルの探索能力を向上させるだけでなく、探索と活用のより効果的なバランスを実現し、優れた性能を発揮することが示されました。

English

In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement -- where models are trained on their own outputs -- has emerged as a primary method for enhancing performance. However, the critical factors underlying the mechanism of these iterative self-improving methods remain poorly understood, such as under what conditions self-improvement is effective, and what are the bottlenecks in the current iterations. In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model's ability to generate sufficiently diverse responses (exploration); and (2) the effectiveness of external rewards in distinguishing high-quality candidates from lower-quality ones (exploitation). Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model's exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well. Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-improving effectiveness based on the current policy model and available rewards. Our experiments on mathematical reasoning, coding, and commonsense reasoning demonstrate that B-STaR not only enhances the model's exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance.

B-STaR: 自己学習型Reasonersにおける探索と活用の監視とバランス調整

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

要旨

Summary

Support

Support