Virgo: o1のようなMLLMの再現に関する予備的探索

要旨

最近、大規模言語モデル（LLM）に基づく遅い思考システムが、推論中の思考時間をスケーリングすることで広く注目されています。また、この能力を多様なモーダリティを扱うマルチモーダル大規模言語モデル（MLLM）に適応させることへの関心が高まっています。MLLMは異なるモダリティ間でより複雑なデータの意味を扱うため、マルチモーダルな遅い思考システムを実装することは直感的により困難です。　本論文では、この問題に対処するために、能力のあるMLLMをわずかなテキスト形式の長い思考データでファインチューニングするという直接的なアプローチを探求し、マルチモーダルな遅い思考システム「Virgo（Visual reasoning with long thought）」を生み出します。自然言語で表現されたこれらの長い思考プロセスが、MLLMに効果的に転送できることがわかりました。さらに、このようなテキスト形式の思考データが、MLLMの遅い思考能力を引き出す上で、視覚的な思考データよりもさらに効果的であるようです。この研究は予備的なものですが、遅い思考能力は言語モデルコンポーネントと基本的に関連しており、モーダリティやドメインを超えて転送できることを示しています。この発見は、より強力な遅い思考推論システムの開発を指針とするために活用できます。リソースはhttps://github.com/RUCAIBox/Virgo で公開しています。

English

Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.

Virgo: o1のようなMLLMの再現に関する予備的探索

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

要旨

Support