Critic-V:VLM 評論家有助於在多模態推理中捕捉 VLM 錯誤
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
November 27, 2024
作者: Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, Dongzhan Zhou
cs.AI
摘要
視覺語言模型(VLMs)在多模態推理任務中展現了顯著的進展。然而,由於存在幻覺圖像理解或推理路徑不完善等問題,它們仍然經常生成不準確或無關的回應。為了應對這些挑戰,我們引入了Critic-V,這是一個受Actor-Critic範式啟發的新型框架,旨在增強VLMs的推理能力。該框架通過集成兩個獨立組件來解耦推理過程和評論過程:Reasoner根據視覺和文本輸入生成推理路徑,而Critic提供建設性評論以精煉這些路徑。在這種方法中,Reasoner根據文本提示生成推理回應,可以根據Critic的反饋進行迭代演進,形成一個基於策略的過程。這種互動過程在理論上受到強化學習框架的驅動,其中Critic提供自然語言評論而不是純量獎勵,從而提供更細緻的反饋,以增強Reasoner在複雜推理任務上的能力。Critic模型使用直接偏好優化(DPO)進行訓練,利用由基於規則獎勵(RBR)排名的評論偏好數據集來增強其評論能力。評估結果顯示,Critic-V框架在8個基準測試中有5個明顯優於現有方法,特別是在推理準確性和效率方面。Reasoner的動態基於文本的策略與經過偏好優化的Critic提供的建設性反饋相結合,實現了更可靠和上下文敏感的多模態推理過程。我們的方法提供了一個有前途的解決方案,以增強VLMs的可靠性,提高它們在現實世界中推理密集的多模態應用(如自動駕駛和具身智能)中的性能。
English
Vision-language models~(VLMs) have shown remarkable advancements in
multimodal reasoning tasks. However, they still often generate inaccurate or
irrelevant responses due to issues like hallucinated image understandings or
unrefined reasoning paths. To address these challenges, we introduce Critic-V,
a novel framework inspired by the Actor-Critic paradigm to boost the reasoning
capability of VLMs. This framework decouples the reasoning process and critic
process by integrating two independent components: the Reasoner, which
generates reasoning paths based on visual and textual inputs, and the Critic,
which provides constructive critique to refine these paths. In this approach,
the Reasoner generates reasoning responses according to text prompts, which can
evolve iteratively as a policy based on feedback from the Critic. This
interaction process was theoretically driven by a reinforcement learning
framework where the Critic offers natural language critiques instead of scalar
rewards, enabling more nuanced feedback to boost the Reasoner's capability on
complex reasoning tasks. The Critic model is trained using Direct Preference
Optimization (DPO), leveraging a preference dataset of critiques ranked by
Rule-based Reward(RBR) to enhance its critic capabilities. Evaluation results
show that the Critic-V framework significantly outperforms existing methods,
including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning
accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner
and constructive feedback from the preference-optimized Critic enables a more
reliable and context-sensitive multimodal reasoning process. Our approach
provides a promising solution to enhance the reliability of VLMs, improving
their performance in real-world reasoning-heavy multimodal applications such as
autonomous driving and embodied intelligence.Summary
AI-Generated Summary