TTRL:測試時強化學習
TTRL: Test-Time Reinforcement Learning
April 22, 2025
作者: Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, Bowen Zhou
cs.AI
摘要
本文探討了在大型語言模型(LLMs)中,針對推理任務於無明確標籤數據上進行強化學習(Reinforcement Learning, RL)的研究。該問題的核心挑戰在於,在推理過程中無法獲取真實標籤信息的情況下進行獎勵估計。儘管這一設定看似難以捉摸,我們發現,測試時縮放(Test-Time Scaling, TTS)中的常見做法,如多數投票,竟能產生出人意料的、適合驅動RL訓練的有效獎勵。在本研究中,我們提出了測試時強化學習(Test-Time Reinforcement Learning, TTRL),這是一種利用無標籤數據訓練LLMs的新方法。TTRL通過利用預訓練模型中的先驗知識,實現了LLMs的自我進化。我們的實驗表明,TTRL在多種任務和模型上均能持續提升性能。值得注意的是,僅使用無標籤測試數據,TTRL便將Qwen-2.5-Math-7B在AIME 2024上的pass@1性能提升了約159%。此外,儘管TTRL僅受Maj@N指標的監督,但其表現已能持續超越初始模型的上限,並接近於直接在帶有真實標籤的測試數據上訓練的模型性能。我們的實驗結果驗證了TTRL在各類任務中的普遍有效性,並凸顯了其在更廣泛任務和領域中的潛力。GitHub: https://github.com/PRIME-RL/TTRL
English
This paper investigates Reinforcement Learning (RL) on data without explicit
labels for reasoning tasks in Large Language Models (LLMs). The core challenge
of the problem is reward estimation during inference while not having access to
ground-truth information. While this setting appears elusive, we find that
common practices in Test-Time Scaling (TTS), such as majority voting, yield
surprisingly effective rewards suitable for driving RL training. In this work,
we introduce Test-Time Reinforcement Learning (TTRL), a novel method for
training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs
by utilizing the priors in the pre-trained models. Our experiments demonstrate
that TTRL consistently improves performance across a variety of tasks and
models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by
approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore,
although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated
performance to consistently surpass the upper limit of the initial model, and
approach the performance of models trained directly on test data with
ground-truth labels. Our experimental findings validate the general
effectiveness of TTRL across various tasks, and highlight TTRL's potential for
broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRLSummary
AI-Generated Summary