1B LLM能否超越405B LLM？重新思考计算优化的测试时间扩展

摘要

测试时间缩放（TTS）是通过在推断阶段使用额外计算来改善大型语言模型（LLMs）性能的重要方法。然而，目前的研究并未系统分析策略模型、过程奖励模型（PRMs）和问题难度如何影响TTS。这种分析缺失限制了对TTS方法的理解和实际应用。本文关注两个核心问题：（1）在不同策略模型、PRMs和问题难度水平之间扩展测试时间计算的最佳方法是什么？（2）在复杂任务上，延长计算能力能够提高LLMs的性能到什么程度，较小的语言模型能否通过这种方法胜过较大的模型？通过对MATH-500和具有挑战性的AIME24任务进行全面实验，我们得出以下观察结果：（1）计算最优的TTS策略高度依赖于策略模型、PRM和问题难度的选择。（2）采用我们计算最优的TTS策略，极小的策略模型可以胜过较大的模型。例如，1B LLM在MATH-500上可以超过405B LLM。此外，在MATH-500和AIME24上，0.5B LLM胜过GPT-4o，3B LLM超越405B LLM，7B LLM击败o1和DeepSeek-R1，同时具有更高的推断效率。这些发现表明，将TTS策略调整到每个任务和模型的特定特征至关重要，并且表明TTS是增强LLMs推理能力的一种有前途的方法。

English

Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

1B LLM能否超越405B LLM？重新思考计算优化的测试时间扩展

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

摘要

Summary

Support