简单的测试时间缩放

摘要

测试时间缩放是一种有前途的语言建模新方法，利用额外的测试时间计算来提高性能。最近，OpenAI的o1模型展示了这种能力，但未公开分享其方法，导致了许多复制努力。我们寻求实现测试时间缩放和强大推理性能的最简单方法。首先，我们筛选了一个包含1,000个问题和推理过程的小数据集s1K，依赖于我们通过消融验证的三个标准：难度、多样性和质量。其次，我们开发了预算强制方法来控制测试时间计算，通过强制终止模型的思考过程或在模型试图结束时多次附加“等待”来延长它。这可以促使模型重新检查其答案，通常修正不正确的推理步骤。在对Qwen2.5-32B-Instruct语言模型在s1K上进行监督微调并配备预算强制后，我们的模型s1在竞赛数学问题上超过了o1-preview最多27%（MATH和AIME24）。此外，通过预算强制对s1进行扩展缩放，使其在AIME24上的表现从50%提高到57%，无需测试时间干预。我们的模型、数据和代码在https://github.com/simplescaling/s1上开源。

English

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.

简单的测试时间缩放

s1: Simple test-time scaling

摘要

Summary

Support