洞察-V:利用多模態大型語言模型探索長鏈視覺推理
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
November 21, 2024
作者: Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu
cs.AI
摘要
大型語言模型(LLMs)通過更多推理,從思維鏈提示進化為像OpenAI o1這樣的產品級解決方案,展示出增強的能力和可靠性。儘管有各種努力來改進LLM的推理能力,但在視覺語言任務中,高質量的長鏈推理數據和優化的訓練流程仍未得到充分探索。在本文中,我們提出了Insight-V,這是一項早期工作,旨在1)可擴展地生成複雜多模式任務的長且穩健的推理數據,以及2)一個有效的訓練流程,以增強多模式大型語言模型(MLLMs)的推理能力。具體來說,為了在沒有人工干預的情況下創建長且結構化的推理數據,我們設計了一個兩步流程,採用漸進策略生成足夠長且多樣化的推理路徑,並使用多粒度評估方法來確保數據質量。我們觀察到,直接監督MLLMs進行這種長且複雜的推理數據將無法產生理想的推理能力。為應對這個問題,我們設計了一個多代理系統,包括一個專門執行長鏈推理的推理代理和一個訓練過的總結代理,用於評估和總結推理結果。我們進一步融入了一個迭代的DPO算法,以增強推理代理的生成穩定性和質量。基於流行的LLaVA-NeXT模型和我們更強大的基礎MLLM,我們展示了在需要視覺推理的具有挑戰性的多模式基準測試中顯著的性能提升。受益於我們的多代理系統,Insight-V還可以輕鬆地在以感知為重點的多模式任務中保持或提高性能。
English
Large Language Models (LLMs) demonstrate enhanced capabilities and
reliability by reasoning more, evolving from Chain-of-Thought prompting to
product-level solutions like OpenAI o1. Despite various efforts to improve LLM
reasoning, high-quality long-chain reasoning data and optimized training
pipelines still remain inadequately explored in vision-language tasks. In this
paper, we present Insight-V, an early effort to 1) scalably produce long and
robust reasoning data for complex multi-modal tasks, and 2) an effective
training pipeline to enhance the reasoning capabilities of multi-modal large
language models (MLLMs). Specifically, to create long and structured reasoning
data without human labor, we design a two-step pipeline with a progressive
strategy to generate sufficiently long and diverse reasoning paths and a
multi-granularity assessment method to ensure data quality. We observe that
directly supervising MLLMs with such long and complex reasoning data will not
yield ideal reasoning ability. To tackle this problem, we design a multi-agent
system consisting of a reasoning agent dedicated to performing long-chain
reasoning and a summary agent trained to judge and summarize reasoning results.
We further incorporate an iterative DPO algorithm to enhance the reasoning
agent's generation stability and quality. Based on the popular LLaVA-NeXT model
and our stronger base MLLM, we demonstrate significant performance gains across
challenging multi-modal benchmarks requiring visual reasoning. Benefiting from
our multi-agent system, Insight-V can also easily maintain or improve
performance on perception-focused multi-modal tasks.Summary
AI-Generated Summary