TTRL:测试时强化学习
TTRL: Test-Time Reinforcement Learning
April 22, 2025
作者: Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, Bowen Zhou
cs.AI
摘要
本文探讨了在无明确标签数据上进行强化学习(Reinforcement Learning, RL)以提升大语言模型(Large Language Models, LLMs)在推理任务中的表现。该问题的核心挑战在于推理过程中缺乏真实信息时的奖励估计。尽管这一情境看似难以捉摸,我们发现测试时缩放(Test-Time Scaling, TTS)中的常见做法,如多数投票,能够产生出人意料的有效奖励,足以驱动RL训练。在本研究中,我们提出了测试时强化学习(Test-Time Reinforcement Learning, TTRL),一种利用未标注数据训练LLMs的新方法。TTRL通过利用预训练模型中的先验知识,实现了LLMs的自我进化。实验表明,TTRL在多种任务和模型上均能持续提升性能。特别地,TTRL仅使用未标注的测试数据,就将Qwen-2.5-Math-7B在AIME 2024上的pass@1性能提升了约159%。此外,尽管TTRL仅受Maj@N指标监督,但其表现不仅持续超越初始模型的上限,还接近了直接使用带真实标签的测试数据训练的模型性能。我们的实验结果验证了TTRL在各类任务中的普遍有效性,并凸显了其在更广泛任务和领域中的潜力。GitHub: https://github.com/PRIME-RL/TTRL
English
This paper investigates Reinforcement Learning (RL) on data without explicit
labels for reasoning tasks in Large Language Models (LLMs). The core challenge
of the problem is reward estimation during inference while not having access to
ground-truth information. While this setting appears elusive, we find that
common practices in Test-Time Scaling (TTS), such as majority voting, yield
surprisingly effective rewards suitable for driving RL training. In this work,
we introduce Test-Time Reinforcement Learning (TTRL), a novel method for
training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs
by utilizing the priors in the pre-trained models. Our experiments demonstrate
that TTRL consistently improves performance across a variety of tasks and
models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by
approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore,
although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated
performance to consistently surpass the upper limit of the initial model, and
approach the performance of models trained directly on test data with
ground-truth labels. Our experimental findings validate the general
effectiveness of TTRL across various tasks, and highlight TTRL's potential for
broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRLSummary
AI-Generated Summary