Virgo:對重現 o1-like MLLM 進行初步探索
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
January 3, 2025
作者: Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
cs.AI
摘要
最近,建立在大型語言模型(LLMs)基礎上的慢思考推理系統通過擴展推理時間而受到廣泛關注。人們也越來越感興趣將這種能力應用於多模態大型語言模型(MLLMs)。鑒於MLLMs跨越不同模態處理更複雜的數據語義,實現多模態慢思考系統在直覺上更具挑戰性。
為了解決這個問題,在本文中,我們探索了一種直接的方法,通過用少量文本長篇思考數據對功能強大的MLLM進行微調,從而產生一個多模態慢思考系統Virgo(具有長篇思考的視覺推理)。我們發現這些以自然語言表達的長篇推理過程可以有效地轉移到MLLMs。此外,這種文本推理數據似乎比視覺推理數據更有效地引發MLLMs的慢思考能力。雖然這項工作還處於初步階段,但它表明慢思考能力基本上與語言模型組件相關,可以跨模態或領域進行轉移。這一發現可以用來引導更強大的慢思考推理系統的開發。我們在https://github.com/RUCAIBox/Virgo 上公開了我們的資源。
English
Recently, slow-thinking reasoning systems, built upon large language models
(LLMs), have garnered widespread attention by scaling the thinking time during
inference. There is also growing interest in adapting this capability to
multimodal large language models (MLLMs). Given that MLLMs handle more complex
data semantics across different modalities, it is intuitively more challenging
to implement multimodal slow-thinking systems.
To address this issue, in this paper, we explore a straightforward approach
by fine-tuning a capable MLLM with a small amount of textual long-form thought
data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning
with long thought). We find that these long-form reasoning processes, expressed
in natural language, can be effectively transferred to MLLMs. Moreover, it
seems that such textual reasoning data can be even more effective than visual
reasoning data in eliciting the slow-thinking capacities of MLLMs. While this
work is preliminary, it demonstrates that slow-thinking capacities are
fundamentally associated with the language model component, which can be
transferred across modalities or domains. This finding can be leveraged to
guide the development of more powerful slow-thinking reasoning systems. We
release our resources at https://github.com/RUCAIBox/Virgo.Summary
AI-Generated Summary