洞察-V：利用多模態大型語言模型探索長鏈視覺推理

摘要

大型語言模型（LLMs）通過更多推理，從思維鏈提示進化為像OpenAI o1這樣的產品級解決方案，展示出增強的能力和可靠性。儘管有各種努力來改進LLM的推理能力，但在視覺語言任務中，高質量的長鏈推理數據和優化的訓練流程仍未得到充分探索。在本文中，我們提出了Insight-V，這是一項早期工作，旨在1）可擴展地生成複雜多模式任務的長且穩健的推理數據，以及2）一個有效的訓練流程，以增強多模式大型語言模型（MLLMs）的推理能力。具體來說，為了在沒有人工干預的情況下創建長且結構化的推理數據，我們設計了一個兩步流程，採用漸進策略生成足夠長且多樣化的推理路徑，並使用多粒度評估方法來確保數據質量。我們觀察到，直接監督MLLMs進行這種長且複雜的推理數據將無法產生理想的推理能力。為應對這個問題，我們設計了一個多代理系統，包括一個專門執行長鏈推理的推理代理和一個訓練過的總結代理，用於評估和總結推理結果。我們進一步融入了一個迭代的DPO算法，以增強推理代理的生成穩定性和質量。基於流行的LLaVA-NeXT模型和我們更強大的基礎MLLM，我們展示了在需要視覺推理的具有挑戰性的多模式基準測試中顯著的性能提升。受益於我們的多代理系統，Insight-V還可以輕鬆地在以感知為重點的多模式任務中保持或提高性能。

English

Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi-modal benchmarks requiring visual reasoning. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks.

洞察-V：利用多模態大型語言模型探索長鏈視覺推理

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

摘要

Summary

Support