评论者-V:VLM评论者帮助捕捉多模态推理中的VLM错误
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
November 27, 2024
作者: Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, Dongzhan Zhou
cs.AI
摘要
视觉语言模型(VLMs)在多模态推理任务中展示了显著的进展。然而,由于存在诸如虚构的图像理解或未经完善的推理路径等问题,它们仍然经常生成不准确或无关的响应。为了解决这些挑战,我们引入了Critic-V,这是一个受Actor-Critic范式启发的新颖框架,旨在提升VLMs的推理能力。该框架通过集成两个独立组件来解耦推理过程和评论过程:Reasoner生成基于视觉和文本输入的推理路径,而Critic提供建设性的批评以完善这些路径。在这种方法中,Reasoner根据文本提示生成推理响应,这些响应可以根据来自Critic的反馈作为策略进行迭代演变。这种交互过程在理论上受强化学习框架驱动,其中Critic提供自然语言批评而非标量奖励,从而提供更加细致的反馈以增强Reasoner在复杂推理任务上的能力。Critic模型使用直接偏好优化(DPO)进行训练,利用一个由基于规则奖励(RBR)排名的评论组成的偏好数据集来增强其评论能力。评估结果显示,Critic-V框架在8个基准测试中的5个中明显优于现有方法,尤其是在推理准确性和效率方面。Reasoner的动态基于文本的策略与经过偏好优化的Critic提供的建设性反馈相结合,使得多模态推理过程更加可靠和上下文敏感。我们的方法为增强VLMs的可靠性提供了一个有前途的解决方案,提高了它们在现实世界推理密集型多模态应用中的性能,如自动驾驶和具身智能。
English
Vision-language models~(VLMs) have shown remarkable advancements in
multimodal reasoning tasks. However, they still often generate inaccurate or
irrelevant responses due to issues like hallucinated image understandings or
unrefined reasoning paths. To address these challenges, we introduce Critic-V,
a novel framework inspired by the Actor-Critic paradigm to boost the reasoning
capability of VLMs. This framework decouples the reasoning process and critic
process by integrating two independent components: the Reasoner, which
generates reasoning paths based on visual and textual inputs, and the Critic,
which provides constructive critique to refine these paths. In this approach,
the Reasoner generates reasoning responses according to text prompts, which can
evolve iteratively as a policy based on feedback from the Critic. This
interaction process was theoretically driven by a reinforcement learning
framework where the Critic offers natural language critiques instead of scalar
rewards, enabling more nuanced feedback to boost the Reasoner's capability on
complex reasoning tasks. The Critic model is trained using Direct Preference
Optimization (DPO), leveraging a preference dataset of critiques ranked by
Rule-based Reward(RBR) to enhance its critic capabilities. Evaluation results
show that the Critic-V framework significantly outperforms existing methods,
including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning
accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner
and constructive feedback from the preference-optimized Critic enables a more
reliable and context-sensitive multimodal reasoning process. Our approach
provides a promising solution to enhance the reliability of VLMs, improving
their performance in real-world reasoning-heavy multimodal applications such as
autonomous driving and embodied intelligence.Summary
AI-Generated Summary