天才：一种通用且纯无监督的自训练框架面向高级推理

摘要

提升大语言模型（LLM）的推理能力已引起广泛关注。然而，现有的后训练技术严重依赖监督信号，如结果监督或辅助奖励模型，这些方法面临可扩展性差和高标注成本的问题。这促使我们探索无需外部监督即可增强LLM推理能力的方法。我们引入了一种通用且完全无监督的自训练框架，命名为Genius。在不依赖外部辅助的情况下，Genius需要逐步寻找最优响应序列并优化LLM。为了探索潜在步骤并利用最优步骤，Genius采用了一种逐步前瞻重采样策略，通过模拟未来结果来采样并估计步骤价值。此外，我们认识到无监督设置不可避免地会引入内在噪声和不确定性。为了提供稳健的优化，我们提出了一种优势校准优化（ACO）损失函数，以减轻估计不一致性。结合这些技术，Genius为在无监督条件下通过通用查询自我提升LLM推理能力迈出了重要的一步，鉴于通用查询的广泛可用性，革新了推理扩展定律。代码将在https://github.com/xufangzhi/Genius发布。

English

Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.

天才：一种通用且纯无监督的自训练框架面向高级推理

Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

摘要

Summary

Support

Support