LlamaV-o1:重新思考LLM中的逐步視覺推理
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
January 10, 2025
作者: Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
cs.AI
摘要
推理是解決複雜多步問題的基本能力,尤其在視覺情境中,其中序列式逐步理解至關重要。現有方法缺乏評估視覺推理的全面框架,也未強調逐步問題解決。為此,我們提出了一個全面的框架,通過三個關鍵貢獻來推進大型語言模型(LMMs)中的逐步視覺推理。首先,我們引入了一個專門設計用於評估多步推理任務的視覺推理基準。該基準提供了一系列不同類別的挑戰,從複雜的視覺感知到科學推理,總共有超過4k個推理步驟,能夠全面評估LLMs在多步準確且可解釋的視覺推理能力。其次,我們提出了一個新穎的度量標準,評估個別步驟的視覺推理質量,強調正確性和邏輯一致性。所提出的度量標準相對於傳統的終端任務準確度度量標準,提供了更深入的推理表現洞察。第三,我們提出了一個新的多模態視覺推理模型,名為LlamaV-o1,採用多步課程學習方法進行訓練,其中任務逐步組織,以促進增量技能獲取和問題解決。所提出的LlamaV-o1設計用於多步推理,通過結構化訓練範式逐步學習。大量實驗表明,我們的LlamaV-o1優於現有的開源模型,在推理擴展時表現優異,並且與封閉源專有模型相比表現良好。與最近的Llava-CoT相比,我們的LlamaV-o1在六個基準測試中取得了平均得分67.3,絕對增益為3.8%,在推理擴展時速度提高了5倍。我們的基準、模型和代碼均可公開獲取。
English
Reasoning is a fundamental capability for solving complex multi-step
problems, particularly in visual contexts where sequential step-wise
understanding is essential. Existing approaches lack a comprehensive framework
for evaluating visual reasoning and do not emphasize step-wise problem-solving.
To this end, we propose a comprehensive framework for advancing step-by-step
visual reasoning in large language models (LMMs) through three key
contributions. First, we introduce a visual reasoning benchmark specifically
designed to evaluate multi-step reasoning tasks. The benchmark presents a
diverse set of challenges with eight different categories ranging from complex
visual perception to scientific reasoning with over 4k reasoning steps in
total, enabling robust evaluation of LLMs' abilities to perform accurate and
interpretable visual reasoning across multiple steps. Second, we propose a
novel metric that assesses visual reasoning quality at the granularity of
individual steps, emphasizing both correctness and logical coherence. The
proposed metric offers deeper insights into reasoning performance compared to
traditional end-task accuracy metrics. Third, we present a new multimodal
visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum
learning approach, where tasks are progressively organized to facilitate
incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is
designed for multi-step reasoning and learns step-by-step through a structured
training paradigm. Extensive experiments show that our LlamaV-o1 outperforms
existing open-source models and performs favorably against close-source
proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an
average score of 67.3 with an absolute gain of 3.8\% across six benchmarks
while being 5 times faster during inference scaling. Our benchmark, model, and
code are publicly available.Summary
AI-Generated Summary