ChatPaper.aiChatPaper

GTR:引导式思维强化防止基于强化学习的视觉语言模型代理训练中的思维崩溃

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

March 11, 2025
作者: Tong Wei, Yijun Yang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
cs.AI

摘要

基于可验证结果奖励的强化学习(RLVR)已有效扩展了大语言模型(LLMs)中的思维链(CoT)推理能力。然而,其在训练视觉语言模型(VLM)代理于视觉环境中进行目标导向动作推理的效果尚不明确。本研究通过复杂纸牌游戏(如24点)及ALFWorld中的具身任务进行了广泛实验,探讨了这一问题。我们发现,当奖励仅基于动作结果时,RL无法激励VLMs中的CoT推理,反而导致了一种我们称之为“思维崩溃”的现象,表现为代理思维多样性迅速丧失、状态无关且不完整的推理,以及随之而来的无效动作,最终导致负奖励。为应对思维崩溃,我们强调了过程指导的必要性,并提出了一种自动校正器,该校正器在每一步RL中评估并优化代理的推理。这一简洁且可扩展的GTR(引导思维强化)框架无需密集的逐步人工标注,即可同时训练推理与动作。实验表明,GTR显著提升了LLaVA-7b模型在多种视觉环境中的性能与泛化能力,相较于当前最先进模型,任务成功率提高了3至5倍,且模型规模显著更小。
English
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7b model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.

Summary

AI-Generated Summary

PDF132March 13, 2025