何谓、如何、何处及成效几何?大语言模型测试时缩放技术综述
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
March 31, 2025
作者: Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, Chen Ma
cs.AI
摘要
随着预训练时代对计算规模(数据和参数)的热情逐渐消退,测试时扩展(Test-Time Scaling, TTS),亦称“测试时计算”,已崭露头角,成为研究热点。近期研究表明,TTS能进一步激发大型语言模型(LLMs)的解题潜能,不仅在数学与编程等专项推理任务上取得重大突破,也在开放式问答等通用任务中表现卓越。然而,尽管该领域近期研究激增,仍亟需一份系统性综述以深化理解。为此,我们提出一个统一的多维度框架,围绕TTS研究的四大核心维度构建:扩展什么、如何扩展、何处扩展及扩展效果。基于此分类体系,我们广泛回顾了方法、应用场景及评估方面,并呈现了一种有序分解,凸显了各项技术在更广阔TTS图景中的独特功能角色。通过这一分析,我们提炼了迄今为止TTS的主要发展轨迹,并提供了实际部署的实用指南。此外,我们识别了若干开放挑战,并对未来方向提出了洞见,包括进一步扩展、澄清技术功能本质、泛化至更多任务及更多归因分析。
English
As enthusiasm for scaling computation (data and parameters) in the
pretraining era gradually diminished, test-time scaling (TTS), also referred to
as ``test-time computing'' has emerged as a prominent research focus. Recent
studies demonstrate that TTS can further elicit the problem-solving
capabilities of large language models (LLMs), enabling significant
breakthroughs not only in specialized reasoning tasks, such as mathematics and
coding, but also in general tasks like open-ended Q&A. However, despite the
explosion of recent efforts in this area, there remains an urgent need for a
comprehensive survey offering a systemic understanding. To fill this gap, we
propose a unified, multidimensional framework structured along four core
dimensions of TTS research: what to scale, how to scale, where to scale, and
how well to scale. Building upon this taxonomy, we conduct an extensive review
of methods, application scenarios, and assessment aspects, and present an
organized decomposition that highlights the unique functional roles of
individual techniques within the broader TTS landscape. From this analysis, we
distill the major developmental trajectories of TTS to date and offer hands-on
guidelines for practical deployment. Furthermore, we identify several open
challenges and offer insights into promising future directions, including
further scaling, clarifying the functional essence of techniques, generalizing
to more tasks, and more attributions.Summary
AI-Generated Summary