Genius：一個通用且純無監督的自訓練框架用於高級推理

摘要

提升大型語言模型（LLM）的推理能力已引起廣泛關注。然而，現有的後訓練技術過度依賴監督信號，如結果監督或輔助獎勵模型，這些方法面臨可擴展性差和註釋成本高的問題。這促使我們探索無需外部監督來增強LLM推理能力的方法。我們提出了一個通用且純粹無監督的自訓練框架，名為Genius。在沒有外部輔助的情況下，Genius需要逐步尋求最佳響應序列並優化LLM。為了探索潛在步驟並利用最優步驟，Genius引入了一種逐步前瞻重採樣策略，通過模擬未來結果來採樣和估計步驟價值。此外，我們認識到無監督設置不可避免地引入了固有的噪聲和不確定性。為了提供穩健的優化，我們提出了一種優勢校準優化（ACO）損失函數，以減輕估計不一致性。結合這些技術，Genius為在無監督情況下通過通用查詢自我提升LLM推理能力提供了先進的初步步驟，鑑於通用查詢的廣泛可用性，這將革新推理的規模定律。代碼將發佈於https://github.com/xufangzhi/Genius。

English

Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.

Genius：一個通用且純無監督的自訓練框架用於高級推理

Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

摘要

Summary

Support

Support