TTRL：測試時強化學習

摘要

本文探討了在大型語言模型（LLMs）中，針對推理任務於無明確標籤數據上進行強化學習（Reinforcement Learning, RL）的研究。該問題的核心挑戰在於，在推理過程中無法獲取真實標籤信息的情況下進行獎勵估計。儘管這一設定看似難以捉摸，我們發現，測試時縮放（Test-Time Scaling, TTS）中的常見做法，如多數投票，竟能產生出人意料的、適合驅動RL訓練的有效獎勵。在本研究中，我們提出了測試時強化學習（Test-Time Reinforcement Learning, TTRL），這是一種利用無標籤數據訓練LLMs的新方法。TTRL通過利用預訓練模型中的先驗知識，實現了LLMs的自我進化。我們的實驗表明，TTRL在多種任務和模型上均能持續提升性能。值得注意的是，僅使用無標籤測試數據，TTRL便將Qwen-2.5-Math-7B在AIME 2024上的pass@1性能提升了約159%。此外，儘管TTRL僅受Maj@N指標的監督，但其表現已能持續超越初始模型的上限，並接近於直接在帶有真實標籤的測試數據上訓練的模型性能。我們的實驗結果驗證了TTRL在各類任務中的普遍有效性，並凸顯了其在更廣泛任務和領域中的潛力。GitHub: https://github.com/PRIME-RL/TTRL

English

This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

TTRL：測試時強化學習

TTRL: Test-Time Reinforcement Learning

摘要

Summary

Support

Support