LLaVA-o1:讓視覺語言模型逐步推理
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
November 15, 2024
作者: Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, Li Yuan
cs.AI
摘要
大型語言模型已經展示出在推理能力方面的顯著進展,特別是通過推理時間的擴展,正如OpenAI的o1等模型所展示的那樣。然而,目前的視覺語言模型(VLMs)在執行系統性和結構化推理時常常遇到困難,特別是在處理複雜的視覺問答任務時。在這項工作中,我們介紹了LLaVA-o1,這是一種新型的VLM,旨在進行自主的多階段推理。與思維鏈提示不同,LLaVA-o1獨立進行摘要、視覺解釋、邏輯推理和結論生成的順序階段。這種結構化方法使LLaVA-o1在推理密集任務上實現了明顯的改進。為了實現這一點,我們編制了LLaVA-o1-100k數據集,將來自各種視覺問答來源的樣本與結構化推理標註相結合。此外,我們提出了一種推理時間階段級別的束搜索方法,實現了有效的推理時間擴展。顯著的是,僅憑100k個訓練樣本和一種簡單而有效的推理時間擴展方法,LLaVA-o1不僅在各種多模態推理基準測試中將其基本模型的表現提高了8.9%,而且超越了Gemini-1.5-pro、GPT-4o-mini和Llama-3.2-90B-Vision-Instruct等更大甚至封閉源模型的性能。
English
Large language models have demonstrated substantial advancements in reasoning
capabilities, particularly through inference-time scaling, as illustrated by
models such as OpenAI's o1. However, current Vision-Language Models (VLMs)
often struggle to perform systematic and structured reasoning, especially when
handling complex visual question-answering tasks. In this work, we introduce
LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning.
Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential
stages of summarization, visual interpretation, logical reasoning, and
conclusion generation. This structured approach enables LLaVA-o1 to achieve
marked improvements in precision on reasoning-intensive tasks. To accomplish
this, we compile the LLaVA-o1-100k dataset, integrating samples from various
visual question answering sources and providing structured reasoning
annotations. Besides, we propose an inference-time stage-level beam search
method, which enables effective inference-time scaling. Remarkably, with only
100k training samples and a simple yet effective inference time scaling method,
LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of
multimodal reasoning benchmarks, but also surpasses the performance of larger
and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and
Llama-3.2-90B-Vision-Instruct.Summary
AI-Generated Summary