天才:一种通用且纯无监督的自训练框架 面向高级推理
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
April 11, 2025
作者: Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu
cs.AI
摘要
提升大语言模型(LLM)的推理能力已引起广泛关注。然而,现有的后训练技术严重依赖监督信号,如结果监督或辅助奖励模型,这些方法面临可扩展性差和高标注成本的问题。这促使我们探索无需外部监督即可增强LLM推理能力的方法。我们引入了一种通用且完全无监督的自训练框架,命名为Genius。在不依赖外部辅助的情况下,Genius需要逐步寻找最优响应序列并优化LLM。为了探索潜在步骤并利用最优步骤,Genius采用了一种逐步前瞻重采样策略,通过模拟未来结果来采样并估计步骤价值。此外,我们认识到无监督设置不可避免地会引入内在噪声和不确定性。为了提供稳健的优化,我们提出了一种优势校准优化(ACO)损失函数,以减轻估计不一致性。结合这些技术,Genius为在无监督条件下通过通用查询自我提升LLM推理能力迈出了重要的一步,鉴于通用查询的广泛可用性,革新了推理扩展定律。代码将在https://github.com/xufangzhi/Genius发布。
English
Advancing LLM reasoning skills has captivated wide interest. However, current
post-training techniques rely heavily on supervisory signals, such as outcome
supervision or auxiliary reward models, which face the problem of scalability
and high annotation costs. This motivates us to enhance LLM reasoning without
the need for external supervision. We introduce a generalizable and purely
unsupervised self-training framework, named Genius. Without external auxiliary,
Genius requires to seek the optimal response sequence in a stepwise manner and
optimize the LLM. To explore the potential steps and exploit the optimal ones,
Genius introduces a stepwise foresight re-sampling strategy to sample and
estimate the step value by simulating future outcomes. Further, we recognize
that the unsupervised setting inevitably induces the intrinsic noise and
uncertainty. To provide a robust optimization, we propose an
advantage-calibrated optimization (ACO) loss function to mitigate estimation
inconsistencies. Combining these techniques together, Genius provides an
advanced initial step towards self-improve LLM reasoning with general queries
and without supervision, revolutionizing reasoning scaling laws given the vast
availability of general queries. The code will be released at
https://github.com/xufangzhi/Genius.Summary
AI-Generated Summary