LLaVA-o1：让视觉语言模型逐步推理

摘要

大型语言模型已经展示出在推理能力方面取得了显著进展，特别是通过推理时间的扩展，正如OpenAI的o1等模型所示。然而，当前的视觉语言模型（VLMs）在处理复杂的视觉问答任务时，通常很难进行系统化和结构化推理。在这项工作中，我们介绍了LLaVA-o1，这是一种新型的VLM，旨在进行自主的多阶段推理。与思维链提示不同，LLaVA-o1独立进行摘要、视觉解释、逻辑推理和结论生成的顺序阶段。这种结构化方法使LLaVA-o1在推理密集型任务上取得了显著的精度改进。为实现这一目标，我们编制了LLaVA-o1-100k数据集，整合了来自各种视觉问答来源的样本，并提供了结构化推理注释。此外，我们提出了一种推理时间阶段级别的波束搜索方法，实现了有效的推理时间扩展。值得注意的是，仅使用100k个训练样本和一种简单而有效的推理时间扩展方法，LLaVA-o1不仅在各种多模态推理基准上比其基础模型提高了8.9%，而且超过了更大甚至是闭源模型的性能，如Gemini-1.5-pro、GPT-4o-mini和Llama-3.2-90B-Vision-Instruct。

English

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

LLaVA-o1：让视觉语言模型逐步推理

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

摘要

Support