1B LLM가 405B LLM을 능가할 수 있을까? 컴퓨팅 최적화된 테스트 시간 스케일링을 재고하기

초록

테스트 시간 스케일링(Test-Time Scaling, TTS)은 추가 계산을 사용하여 추론 단계에서 대형 언어 모델(Large Language Models, LLMs)의 성능을 향상시키는 중요한 방법입니다. 그러나 현재의 연구는 정책 모델, 프로세스 보상 모델(Process Reward Models, PRMs), 그리고 문제의 난이도가 TTS에 어떻게 영향을 미치는지를 체계적으로 분석하지 않습니다. 이러한 분석 부족으로 TTS 방법의 이해와 실용적 사용이 제한됩니다. 본 논문에서는 두 가지 핵심 질문에 초점을 맞춥니다: (1) 다른 정책 모델, PRMs, 그리고 문제 난이도에 걸쳐 테스트 시간 계산을 확장하는 최적의 방법은 무엇인가요? (2) 확장된 계산이 복잡한 작업에서 LLMs의 성능을 얼마나 향상시키며, 이 방법을 통해 보다 작은 언어 모델이 큰 모델을 능가할 수 있을까요? MATH-500 및 어려운 AIME24 작업에 대한 포괄적인 실험을 통해 다음과 같은 관찰 결과를 얻었습니다: (1) 계산 최적화 TTS 전략은 정책 모델, PRM, 그리고 문제 난이도 선택에 매우 의존적입니다. (2) 계산 최적화 TTS 전략을 사용하면 극히 작은 정책 모델이 큰 모델을 능가할 수 있습니다. 예를 들어, 1B LLM은 MATH-500에서 405B LLM을 능가할 수 있습니다. 또한, MATH-500 및 AIME24 모두에서 0.5B LLM이 GPT-4o를 능가하고, 3B LLM이 405B LLM을 능가하며, 7B LLM이 o1 및 DeepSeek-R1을 이기면서 추론 효율성이 높습니다. 이러한 결과는 TTS 전략을 각 작업과 모델의 특성에 맞게 조정하는 중요성을 보여주며, TTS가 LLMs의 추론 능력을 향상시키는 유망한 접근 방법임을 나타냅니다.

English

Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

1B LLM가 405B LLM을 능가할 수 있을까? 컴퓨팅 최적화된 테스트 시간 스케일링을 재고하기

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

초록

Summary

Support