LlamaV-o1:重新思考LLM中的逐步视觉推理
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
January 10, 2025
作者: Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
cs.AI
摘要
推理是解决复杂多步问题的基本能力,特别是在视觉环境中,其中序贯逐步理解至关重要。现有方法缺乏评估视觉推理的全面框架,并且不强调逐步问题解决。为此,我们提出了一个全面的框架,通过三个关键贡献推进大型语言模型(LMMs)中的逐步视觉推理。首先,我们引入了一个专门设计用于评估多步推理任务的视觉推理基准。该基准提供了一个多样化的挑战集,涵盖了八个不同类别,从复杂的视觉感知到科学推理,总共包含超过4k个推理步骤,能够对LLMs在多个步骤中执行准确和可解释的视觉推理能力进行强大评估。其次,我们提出了一种新颖的度量标准,以个别步骤的粒度评估视觉推理质量,强调正确性和逻辑连贯性。所提出的度量标准相较于传统的最终任务准确度指标,能够提供更深入的推理表现洞察。第三,我们提出了一个新的多模态视觉推理模型,命名为LlamaV-o1,采用多步课程学习方法进行训练,任务逐渐组织以促进增量技能习得和问题解决。所提出的LlamaV-o1专为多步推理而设计,并通过结构化训练范式逐步学习。广泛实验证明,我们的LlamaV-o1胜过现有的开源模型,并在推理扩展时表现优异,与封闭源专有模型相比。与最近的Llava-CoT相比,我们的LlamaV-o1在六个基准测试中取得了67.3的平均分,绝对增益为3.8\%,同时在推理扩展时速度提高了5倍。我们的基准测试、模型和代码均可公开获取。
English
Reasoning is a fundamental capability for solving complex multi-step
problems, particularly in visual contexts where sequential step-wise
understanding is essential. Existing approaches lack a comprehensive framework
for evaluating visual reasoning and do not emphasize step-wise problem-solving.
To this end, we propose a comprehensive framework for advancing step-by-step
visual reasoning in large language models (LMMs) through three key
contributions. First, we introduce a visual reasoning benchmark specifically
designed to evaluate multi-step reasoning tasks. The benchmark presents a
diverse set of challenges with eight different categories ranging from complex
visual perception to scientific reasoning with over 4k reasoning steps in
total, enabling robust evaluation of LLMs' abilities to perform accurate and
interpretable visual reasoning across multiple steps. Second, we propose a
novel metric that assesses visual reasoning quality at the granularity of
individual steps, emphasizing both correctness and logical coherence. The
proposed metric offers deeper insights into reasoning performance compared to
traditional end-task accuracy metrics. Third, we present a new multimodal
visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum
learning approach, where tasks are progressively organized to facilitate
incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is
designed for multi-step reasoning and learns step-by-step through a structured
training paradigm. Extensive experiments show that our LlamaV-o1 outperforms
existing open-source models and performs favorably against close-source
proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an
average score of 67.3 with an absolute gain of 3.8\% across six benchmarks
while being 5 times faster during inference scaling. Our benchmark, model, and
code are publicly available.Summary
AI-Generated Summary