简单的测试时间缩放
s1: Simple test-time scaling
January 31, 2025
作者: Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
cs.AI
摘要
测试时间缩放是一种有前途的语言建模新方法,利用额外的测试时间计算来提高性能。最近,OpenAI的o1模型展示了这种能力,但未公开分享其方法,导致了许多复制努力。我们寻求实现测试时间缩放和强大推理性能的最简单方法。首先,我们筛选了一个包含1,000个问题和推理过程的小数据集s1K,依赖于我们通过消融验证的三个标准:难度、多样性和质量。其次,我们开发了预算强制方法来控制测试时间计算,通过强制终止模型的思考过程或在模型试图结束时多次附加“等待”来延长它。这可以促使模型重新检查其答案,通常修正不正确的推理步骤。在对Qwen2.5-32B-Instruct语言模型在s1K上进行监督微调并配备预算强制后,我们的模型s1在竞赛数学问题上超过了o1-preview最多27%(MATH和AIME24)。此外,通过预算强制对s1进行扩展缩放,使其在AIME24上的表现从50%提高到57%,无需测试时间干预。我们的模型、数据和代码在https://github.com/simplescaling/s1上开源。
English
Test-time scaling is a promising new approach to language modeling that uses
extra test-time compute to improve performance. Recently, OpenAI's o1 model
showed this capability but did not publicly share its methodology, leading to
many replication efforts. We seek the simplest approach to achieve test-time
scaling and strong reasoning performance. First, we curate a small dataset s1K
of 1,000 questions paired with reasoning traces relying on three criteria we
validate through ablations: difficulty, diversity, and quality. Second, we
develop budget forcing to control test-time compute by forcefully terminating
the model's thinking process or lengthening it by appending "Wait" multiple
times to the model's generation when it tries to end. This can lead the model
to double-check its answer, often fixing incorrect reasoning steps. After
supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and
equipping it with budget forcing, our model s1 exceeds o1-preview on
competition math questions by up to 27% (MATH and AIME24). Further, scaling s1
with budget forcing allows extrapolating beyond its performance without
test-time intervention: from 50% to 57% on AIME24. Our model, data, and code
are open-source at https://github.com/simplescaling/s1.Summary
AI-Generated Summary