QLASS：通过Q引导的逐步搜索增强语言代理推理

摘要

语言代理已成为复杂交互任务的一种有前途的解决方案。语言代理成功的关键因素之一是代理工作流轨迹上的奖励模型，该模型在训练或推理过程中提供有价值的指导。然而，由于中间交互的缺乏注释，大多数现有作品使用结果奖励模型来优化整个轨迹上的策略。这可能导致次优策略并阻碍整体性能。为了解决这个问题，我们提出了QLASS（Q引导的语言代理逐步搜索），通过逐步估计Q值为开放语言代理自动生成注释。通过引入推理树和执行过程奖励建模，QLASS为每个步骤提供了有效的中间指导。借助逐步指导，我们提出了一种Q引导的生成策略，使语言代理能够更好地适应长期价值，从而在复杂交互代理任务的模型推理过程中实现显著性能改进。值得注意的是，即使使用了几乎一半的注释数据，QLASS仍保持强大的性能，展示了其在处理有限监督方面的效率。我们还通过定性分析实证证明了QLASS可以导致更有效的决策制定。我们将发布我们的代码和数据。

English

Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.

QLASS：通过Q引导的逐步搜索增强语言代理推理

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

摘要

Summary

Support