重新审视类o1模型的测试时扩展能力：它们是否真正具备测试时扩展特性？

摘要

大型语言模型（LLMs）中测试时缩放技术的出现，以OpenAI的o1系列为代表，通过推理过程中计算资源分配的动态调整，显著提升了模型的推理能力。尽管后续模型如QwQ、Deepseek-R1（R1）和LIMO复现了这些进步，但这些模型是否真正具备测试时缩放能力仍待深入探究。本研究发现，这些类o1模型生成的较长思维链（CoTs）并不总能提高准确性；事实上，对于同一问题，正确的解答往往比错误的更短。进一步研究表明，这一现象与模型的自我修正能力密切相关——较长的CoTs包含更多自我修正，而这些修正往往导致性能下降。随后，我们对比了QwQ、R1和LIMO上的串行与并行缩放策略，发现并行缩放能实现更好的覆盖范围和可扩展性。基于这些发现，我们提出了“最短多数投票法”，该方法结合了并行缩放策略与CoT长度特征，相较于传统的多数投票方法，显著提升了模型的测试时缩放能力。

English

The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI's o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models' self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models' test-time scalability compared to conventional majority voting approaches.

重新审视类o1模型的测试时扩展能力：它们是否真正具备测试时扩展特性？

Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?

摘要

Summary

Support