语言模型能够通过自我提升来优化状态价值估计，从而实现更高效的搜索。

摘要

为多步推理任务收集真实任务完成奖励或人类示范往往成本高昂且耗时，尤其是在网页任务等交互领域。针对这一瓶颈，我们提出了自教导前瞻法，这是一种自监督方法，它利用状态转移动态来训练一个能够有效指导语言模型控制搜索的价值模型。我们发现，通过自教导前瞻法改进的中等规模（80亿参数）开放权重价值模型，其性能可与使用前沿大语言模型（如gpt-4o）作为价值模型相媲美。此外，我们发现自教导前瞻法在无需依赖真实奖励的情况下，相比之前基于大语言模型的树搜索方法，提升了20%的性能，同时降低了37倍的成本。

English

Collecting ground truth task completion rewards or human demonstrations for multi-step reasoning tasks is often cost-prohibitive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead, a self-supervised method that leverages state-transition dynamics to train a value model capable of effectively guiding language model-controlled search. We find that moderately sized (8 billion parameters) open-weight value models improved with self-taught lookahead can match the performance of using a frontier LLM such as gpt-4o as the value model. Furthermore, we find that self-taught lookahead improves performance by 20% while reducing costs 37x compared to previous LLM-based tree search, without relying on ground truth rewards.

语言模型能够通过自我提升来优化状态价值估计，从而实现更高效的搜索。

Language Models can Self-Improve at State-Value Estimation for Better Search

摘要

Summary

Support

Support