大型语言模型中推理与性能的关系——o3(迷你版)更注重深度思考,而非延长思考时间
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
February 21, 2025
作者: Marthe Ballon, Andres Algaba, Vincent Ginis
cs.AI
摘要
大型语言模型在数学推理方面展现了显著进展,得益于思维链和测试时计算规模的扩展。然而,关于推理标记使用与准确性提升之间的相互作用,仍存在诸多未解之谜。特别是在跨代模型比较时,性能提升究竟源于更长的推理链还是更高效的推理,尚不明确。我们系统分析了Omni-MATH基准上o1-mini与o3-mini变体的思维链长度,发现o3-mini (m)在无需比o1-mini更长的推理链情况下,实现了更高的准确率。此外,研究表明,在所有模型和计算设置中,随着推理链的增长,准确率普遍下降,即便在控制问题难度的情况下也是如此。这一准确率下降在更精通的模型中显著较小,暗示新一代推理模型更有效地利用了测试时计算资源。最后,我们指出,尽管o3-mini (h)相较于o3-mini (m)实现了微小的准确率提升,但这是通过在所有问题上分配显著更多的推理标记实现的,包括那些o3-mini (m)已能解决的问题。这些发现为模型能力与推理长度之间的关系提供了新见解,对效率、扩展性及评估方法具有重要启示。
English
Large language models have demonstrated remarkable progress in mathematical
reasoning, leveraging chain-of-thought and test-time compute scaling. However,
many open questions remain regarding the interplay between reasoning token
usage and accuracy gains. In particular, when comparing models across
generations, it is unclear whether improved performance results from longer
reasoning chains or more efficient reasoning. We systematically analyze
chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH
benchmark, finding that o3-mini (m) achieves superior accuracy without
requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy
generally declines as reasoning chains grow across all models and compute
settings, even when controlling for difficulty of the questions. This accuracy
drop is significantly smaller in more proficient models, suggesting that new
generations of reasoning models use test-time compute more effectively.
Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain
over o3-mini (m), it does so by allocating substantially more reasoning tokens
across all problems, even the ones that o3-mini (m) can already solve. These
findings provide new insights into the relationship between model capability
and reasoning length, with implications for efficiency, scaling, and evaluation
methodologies.Summary
AI-Generated Summary