复杂任务中的推理时扩展：现状与未来展望

摘要

推理时扩展能够增强大型语言模型（LLMs）在需要逐步解决的复杂问题上的推理能力。尽管延长生成的草稿纸对于数学任务已被证明有效，但这种方法对其他任务的广泛影响仍不明确。在本研究中，我们探讨了扩展方法在九种最先进模型和八项挑战性任务中的优势与局限，这些任务包括数学与STEM推理、日程规划、NP难问题、导航以及空间推理。我们通过涉及重复模型调用的评估协议，将传统模型（如GPT-4o）与针对推理时扩展进行微调的模型（如o1）进行比较，这些调用可以是独立的，也可以是带有反馈的连续调用。这些评估近似于每个模型性能的下限与上限，以及未来通过增强训练或多模型推理系统可能实现的性能提升潜力。我们广泛的实证分析表明，推理时扩展的优势因任务而异，并随着问题复杂度的增加而减弱。此外，在这些高难度场景下，单纯使用更多token并不必然转化为更高的准确率。使用完美验证器的传统模型在多次独立运行中的结果显示，对于某些任务，这些模型能够接近当前最先进推理模型的平均性能。然而，对于其他任务，即便在极高的扩展规模下，性能差距依然显著。令人鼓舞的是，所有模型在进一步通过完美验证器或强反馈进行推理扩展时，均展现出显著的性能提升，预示着未来改进的广阔空间。

English

Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.

复杂任务中的推理时扩展：现状与未来展望

Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

摘要

Summary

Support

Support