处女座:关于复制o1类MLLM的初步探索
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
January 3, 2025
作者: Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
cs.AI
摘要
最近,基于大型语言模型(LLMs)构建的慢思考推理系统通过扩展推理时间而受到广泛关注。人们也越来越感兴趣将这种能力应用于多模态大型语言模型(MLLMs)。考虑到MLLMs处理跨不同模态的更复杂数据语义,实现多模态慢思考系统在直觉上更具挑战性。
为了解决这个问题,在本文中,我们探讨了一种简单的方法,即通过对具有少量文本长篇思考数据进行微调,从而实现了一种多模态慢思考系统Virgo(Visual reasoning with long thought)。我们发现,用自然语言表达的这些长篇推理过程可以有效地转移到MLLMs中。此外,似乎这种文本推理数据甚至比视觉推理数据更有效地激发了MLLMs的慢思考能力。虽然这项工作还处于初步阶段,但它表明慢思考能力基本上与语言模型组件相关联,可以跨模态或领域进行转移。这一发现可用于指导更强大的慢思考推理系统的开发。我们在 https://github.com/RUCAIBox/Virgo 上发布了我们的资源。
English
Recently, slow-thinking reasoning systems, built upon large language models
(LLMs), have garnered widespread attention by scaling the thinking time during
inference. There is also growing interest in adapting this capability to
multimodal large language models (MLLMs). Given that MLLMs handle more complex
data semantics across different modalities, it is intuitively more challenging
to implement multimodal slow-thinking systems.
To address this issue, in this paper, we explore a straightforward approach
by fine-tuning a capable MLLM with a small amount of textual long-form thought
data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning
with long thought). We find that these long-form reasoning processes, expressed
in natural language, can be effectively transferred to MLLMs. Moreover, it
seems that such textual reasoning data can be even more effective than visual
reasoning data in eliciting the slow-thinking capacities of MLLMs. While this
work is preliminary, it demonstrates that slow-thinking capacities are
fundamentally associated with the language model component, which can be
transferred across modalities or domains. This finding can be leveraged to
guide the development of more powerful slow-thinking reasoning systems. We
release our resources at https://github.com/RUCAIBox/Virgo.Summary
AI-Generated Summary