洞见-V:利用多模态大型语言模型探索长链视觉推理
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
November 21, 2024
作者: Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu
cs.AI
摘要
大型语言模型(LLMs)通过更多推理展示了增强的能力和可靠性,从“链式思考”提示发展到像OpenAI o1这样的产品级解决方案。尽管有各种努力改进LLM推理能力,但在视觉-语言任务中,高质量的长链推理数据和优化的训练流程仍然未得到充分探索。在本文中,我们提出了Insight-V,这是一个早期尝试,旨在1)可扩展地生成复杂多模态任务的长而稳健的推理数据,以及2)一个有效的训练流程,以增强多模态大型语言模型(MLLMs)的推理能力。具体来说,为了无需人工劳动创建长而结构化的推理数据,我们设计了一个两步流程,采用渐进策略生成足够长且多样化的推理路径,以及多粒度评估方法来确保数据质量。我们观察到,直接监督MLLMs使用这种长而复杂的推理数据将无法获得理想的推理能力。为了解决这个问题,我们设计了一个多代理系统,包括一个专门执行长链推理的推理代理和一个训练用于判断和总结推理结果的摘要代理。我们进一步结合迭代的DPO算法来增强推理代理的生成稳定性和质量。基于流行的LLaVA-NeXT模型和我们更强大的基础MLLM,我们展示了在需要视觉推理的具有挑战性的多模态基准测试中的显著性能提升。受益于我们的多代理系统,Insight-V还可以轻松地维持或提高在以感知为重点的多模态任务上的性能。
English
Large Language Models (LLMs) demonstrate enhanced capabilities and
reliability by reasoning more, evolving from Chain-of-Thought prompting to
product-level solutions like OpenAI o1. Despite various efforts to improve LLM
reasoning, high-quality long-chain reasoning data and optimized training
pipelines still remain inadequately explored in vision-language tasks. In this
paper, we present Insight-V, an early effort to 1) scalably produce long and
robust reasoning data for complex multi-modal tasks, and 2) an effective
training pipeline to enhance the reasoning capabilities of multi-modal large
language models (MLLMs). Specifically, to create long and structured reasoning
data without human labor, we design a two-step pipeline with a progressive
strategy to generate sufficiently long and diverse reasoning paths and a
multi-granularity assessment method to ensure data quality. We observe that
directly supervising MLLMs with such long and complex reasoning data will not
yield ideal reasoning ability. To tackle this problem, we design a multi-agent
system consisting of a reasoning agent dedicated to performing long-chain
reasoning and a summary agent trained to judge and summarize reasoning results.
We further incorporate an iterative DPO algorithm to enhance the reasoning
agent's generation stability and quality. Based on the popular LLaVA-NeXT model
and our stronger base MLLM, we demonstrate significant performance gains across
challenging multi-modal benchmarks requiring visual reasoning. Benefiting from
our multi-agent system, Insight-V can also easily maintain or improve
performance on perception-focused multi-modal tasks.Summary
AI-Generated Summary