ChatPaper.aiChatPaper

评论者-V:VLM评论者帮助捕捉多模态推理中的VLM错误

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

November 27, 2024
作者: Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, Dongzhan Zhou
cs.AI

摘要

视觉语言模型(VLMs)在多模态推理任务中展示了显著的进展。然而,由于存在诸如虚构的图像理解或未经完善的推理路径等问题,它们仍然经常生成不准确或无关的响应。为了解决这些挑战,我们引入了Critic-V,这是一个受Actor-Critic范式启发的新颖框架,旨在提升VLMs的推理能力。该框架通过集成两个独立组件来解耦推理过程和评论过程:Reasoner生成基于视觉和文本输入的推理路径,而Critic提供建设性的批评以完善这些路径。在这种方法中,Reasoner根据文本提示生成推理响应,这些响应可以根据来自Critic的反馈作为策略进行迭代演变。这种交互过程在理论上受强化学习框架驱动,其中Critic提供自然语言批评而非标量奖励,从而提供更加细致的反馈以增强Reasoner在复杂推理任务上的能力。Critic模型使用直接偏好优化(DPO)进行训练,利用一个由基于规则奖励(RBR)排名的评论组成的偏好数据集来增强其评论能力。评估结果显示,Critic-V框架在8个基准测试中的5个中明显优于现有方法,尤其是在推理准确性和效率方面。Reasoner的动态基于文本的策略与经过偏好优化的Critic提供的建设性反馈相结合,使得多模态推理过程更加可靠和上下文敏感。我们的方法为增强VLMs的可靠性提供了一个有前途的解决方案,提高了它们在现实世界推理密集型多模态应用中的性能,如自动驾驶和具身智能。
English
Vision-language models~(VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths. In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner's capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward(RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner and constructive feedback from the preference-optimized Critic enables a more reliable and context-sensitive multimodal reasoning process. Our approach provides a promising solution to enhance the reliability of VLMs, improving their performance in real-world reasoning-heavy multimodal applications such as autonomous driving and embodied intelligence.

Summary

AI-Generated Summary

PDF372November 29, 2024