T1：小型語言模型中的工具整合自我驗證，用於測試時計算規模調整

摘要

近期研究表明，测试时计算扩展能有效提升小型语言模型（sLMs）的性能。然而，先前的研究主要探讨了借助额外的大型模型作为验证器的测试时计算扩展，而对sLMs自我验证的探索尚不充分。在本研究中，我们探究了sLMs在测试时扩展下能否可靠地自我验证其输出。我们发现，即使通过知识蒸馏从更大的验证器学习，sLMs在处理需要记忆的验证任务（如数值计算和事实核查）时仍面临困难。为克服这一局限，我们提出了工具集成的自我验证方法（T1），该方法将记忆密集型的验证步骤委托给外部工具，如代码解释器。我们的理论分析表明，工具集成降低了记忆需求，并提升了测试时扩展的性能。在MATH基准测试上的实验显示，采用T1后，Llama-3.2 1B模型在测试时扩展下的表现超越了显著更大的Llama-3.1 8B模型。此外，T1在数学（MATH500）和多领域知识密集型任务（MMLU-Pro）上均展现出良好的泛化能力。我们的发现凸显了工具集成在显著提升sLMs自我验证能力方面的潜力。

English

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.

T1：小型語言模型中的工具整合自我驗證，用於測試時計算規模調整

T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

摘要

Summary

Support

Support