LLaVA-o1:让视觉语言模型逐步推理
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
November 15, 2024
作者: Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, Li Yuan
cs.AI
摘要
大型语言模型已经展示出在推理能力方面取得了显著进展,特别是通过推理时间的扩展,正如OpenAI的o1等模型所示。然而,当前的视觉语言模型(VLMs)在处理复杂的视觉问答任务时,通常很难进行系统化和结构化推理。在这项工作中,我们介绍了LLaVA-o1,这是一种新型的VLM,旨在进行自主的多阶段推理。与思维链提示不同,LLaVA-o1独立进行摘要、视觉解释、逻辑推理和结论生成的顺序阶段。这种结构化方法使LLaVA-o1在推理密集型任务上取得了显著的精度改进。为实现这一目标,我们编制了LLaVA-o1-100k数据集,整合了来自各种视觉问答来源的样本,并提供了结构化推理注释。此外,我们提出了一种推理时间阶段级别的波束搜索方法,实现了有效的推理时间扩展。值得注意的是,仅使用100k个训练样本和一种简单而有效的推理时间扩展方法,LLaVA-o1不仅在各种多模态推理基准上比其基础模型提高了8.9%,而且超过了更大甚至是闭源模型的性能,如Gemini-1.5-pro、GPT-4o-mini和Llama-3.2-90B-Vision-Instruct。
English
Large language models have demonstrated substantial advancements in reasoning
capabilities, particularly through inference-time scaling, as illustrated by
models such as OpenAI's o1. However, current Vision-Language Models (VLMs)
often struggle to perform systematic and structured reasoning, especially when
handling complex visual question-answering tasks. In this work, we introduce
LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning.
Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential
stages of summarization, visual interpretation, logical reasoning, and
conclusion generation. This structured approach enables LLaVA-o1 to achieve
marked improvements in precision on reasoning-intensive tasks. To accomplish
this, we compile the LLaVA-o1-100k dataset, integrating samples from various
visual question answering sources and providing structured reasoning
annotations. Besides, we propose an inference-time stage-level beam search
method, which enables effective inference-time scaling. Remarkably, with only
100k training samples and a simple yet effective inference time scaling method,
LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of
multimodal reasoning benchmarks, but also surpasses the performance of larger
and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and
Llama-3.2-90B-Vision-Instruct.Summary
AI-Generated Summary