处女座：关于复制o1类MLLM的初步探索

摘要

最近，基于大型语言模型（LLMs）构建的慢思考推理系统通过扩展推理时间而受到广泛关注。人们也越来越感兴趣将这种能力应用于多模态大型语言模型（MLLMs）。考虑到MLLMs处理跨不同模态的更复杂数据语义，实现多模态慢思考系统在直觉上更具挑战性。为了解决这个问题，在本文中，我们探讨了一种简单的方法，即通过对具有少量文本长篇思考数据进行微调，从而实现了一种多模态慢思考系统Virgo（Visual reasoning with long thought）。我们发现，用自然语言表达的这些长篇推理过程可以有效地转移到MLLMs中。此外，似乎这种文本推理数据甚至比视觉推理数据更有效地激发了MLLMs的慢思考能力。虽然这项工作还处于初步阶段，但它表明慢思考能力基本上与语言模型组件相关联，可以跨模态或领域进行转移。这一发现可用于指导更强大的慢思考推理系统的开发。我们在 https://github.com/RUCAIBox/Virgo 上发布了我们的资源。

English

Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.

处女座：关于复制o1类MLLM的初步探索

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

摘要

Summary

Support