洞见-V：利用多模态大型语言模型探索长链视觉推理

摘要

大型语言模型（LLMs）通过更多推理展示了增强的能力和可靠性，从“链式思考”提示发展到像OpenAI o1这样的产品级解决方案。尽管有各种努力改进LLM推理能力，但在视觉-语言任务中，高质量的长链推理数据和优化的训练流程仍然未得到充分探索。在本文中，我们提出了Insight-V，这是一个早期尝试，旨在1）可扩展地生成复杂多模态任务的长而稳健的推理数据，以及2）一个有效的训练流程，以增强多模态大型语言模型（MLLMs）的推理能力。具体来说，为了无需人工劳动创建长而结构化的推理数据，我们设计了一个两步流程，采用渐进策略生成足够长且多样化的推理路径，以及多粒度评估方法来确保数据质量。我们观察到，直接监督MLLMs使用这种长而复杂的推理数据将无法获得理想的推理能力。为了解决这个问题，我们设计了一个多代理系统，包括一个专门执行长链推理的推理代理和一个训练用于判断和总结推理结果的摘要代理。我们进一步结合迭代的DPO算法来增强推理代理的生成稳定性和质量。基于流行的LLaVA-NeXT模型和我们更强大的基础MLLM，我们展示了在需要视觉推理的具有挑战性的多模态基准测试中的显著性能提升。受益于我们的多代理系统，Insight-V还可以轻松地维持或提高在以感知为重点的多模态任务上的性能。

English

Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi-modal benchmarks requiring visual reasoning. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks.

洞见-V：利用多模态大型语言模型探索长链视觉推理

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

摘要

Support