重新审视类o1模型的测试时扩展能力:它们是否真正具备测试时扩展特性?
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
February 17, 2025
作者: Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, Xipeng Qiu
cs.AI
摘要
大型语言模型(LLMs)中测试时缩放技术的出现,以OpenAI的o1系列为代表,通过推理过程中计算资源分配的动态调整,显著提升了模型的推理能力。尽管后续模型如QwQ、Deepseek-R1(R1)和LIMO复现了这些进步,但这些模型是否真正具备测试时缩放能力仍待深入探究。本研究发现,这些类o1模型生成的较长思维链(CoTs)并不总能提高准确性;事实上,对于同一问题,正确的解答往往比错误的更短。进一步研究表明,这一现象与模型的自我修正能力密切相关——较长的CoTs包含更多自我修正,而这些修正往往导致性能下降。随后,我们对比了QwQ、R1和LIMO上的串行与并行缩放策略,发现并行缩放能实现更好的覆盖范围和可扩展性。基于这些发现,我们提出了“最短多数投票法”,该方法结合了并行缩放策略与CoT长度特征,相较于传统的多数投票方法,显著提升了模型的测试时缩放能力。
English
The advent of test-time scaling in large language models (LLMs), exemplified
by OpenAI's o1 series, has advanced reasoning capabilities by scaling
computational resource allocation during inference. While successors like QwQ,
Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models
truly possess test-time scaling capabilities remains underexplored. This study
found that longer CoTs of these o1-like models do not consistently enhance
accuracy; in fact, correct solutions are often shorter than incorrect ones for
the same questions. Further investigation shows this phenomenon is closely
related to models' self-revision capabilities - longer CoTs contain more
self-revisions, which often lead to performance degradation. We then compare
sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that
parallel scaling achieves better coverage and scalability. Based on these
insights, we propose Shortest Majority Vote, a method that combines parallel
scaling strategies with CoT length characteristics, significantly improving
models' test-time scalability compared to conventional majority voting
approaches.Summary
AI-Generated Summary