Z1：基于代码的高效测试时扩展

摘要

大型语言模型（LLMs）通过测试时计算扩展能够实现更复杂的解题能力，但这通常伴随着更长的上下文和大量的推理标记成本。本文提出了一种高效的测试时扩展方法，该方法通过训练LLMs处理代码相关的推理轨迹，帮助其减少多余的思考标记，同时保持性能。首先，我们创建了Z1-Code-Reasoning-107K，这是一个精心策划的数据集，包含简单和复杂的编程问题及其对应的简短和详细解决方案轨迹。其次，我们引入了一种新颖的“思维窗口偏移”技术，通过移除上下文界定标签（如<think>. . . </think>）并限制推理标记数量，来减轻过度思考的开销。结合长、短轨迹数据训练并配备“思维窗口偏移”的模型Z1-7B，展现了根据问题复杂度调整推理深度的能力，并在不同推理任务中实现了高效的测试时扩展，其表现与R1-Distill-Qwen-7B相当，但平均思考标记数仅为后者的约30%。值得注意的是，仅通过代码轨迹微调的Z1-7B，在更广泛的推理任务上展现了泛化能力（在GPQA Diamond上达到47.5%）。我们对高效推理激发机制的分析，也为未来研究提供了宝贵的洞见。

English

Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time scaling method that trains LLMs on code-related reasoning trajectories, facilitating their reduction of excess thinking tokens while maintaining performance. First, we create Z1-Code-Reasoning-107K, a curated dataset of simple and complex coding problems paired with their short and long solution trajectories. Second, we present a novel Shifted Thinking Window to mitigate overthinking overhead by removing context-delimiting tags (e.g., <think>. . . </think>) and capping reasoning tokens. Trained with long and short trajectory data and equipped with Shifted Thinking Window, our model, Z1-7B, demonstrates the ability to adjust its reasoning level as the complexity of problems and exhibits efficient test-time scaling across different reasoning tasks that matches R1-Distill-Qwen-7B performance with about 30% of its average thinking tokens. Notably, fine-tuned with only code trajectories, Z1-7B demonstrates generalization to broader reasoning tasks (47.5% on GPQA Diamond). Our analysis of efficient reasoning elicitation also provides valuable insights for future research.

Z1：基于代码的高效测试时扩展

Z1: Efficient Test-time Scaling with Code

摘要

Summary

Support

Support