通過混合偏好優化來增強多模態大型語言模型的推理能力

摘要

現有的開源多模式大型語言模型（MLLMs）通常遵循包括預訓練和監督微調的訓練過程。然而，這些模型存在分佈偏移問題，限制了它們的多模式推理能力，尤其是在思維鏈（CoT）表現方面。為了解決這個問題，我們引入了一個偏好優化（PO）過程，以增強MLLMs的多模式推理能力。具體來說，（1）在數據方面，我們設計了一個自動化的偏好數據構建流程，創建了MMPR，一個高質量、大規模的多模式推理偏好數據集；以及（2）在模型方面，我們探索將PO與MLLMs集成，開發了一種簡單而有效的方法，稱為混合偏好優化（MPO），可以提升多模式CoT表現。我們的方法在多個基準測試中展示了改進的性能，特別是在多模式推理任務中。值得注意的是，我們的模型InternVL2-8B-MPO在MathVista上實現了67.0的準確率，比InternVL2-8B高出8.7個百分點，並實現了與10倍更大的InternVL2-76B相當的性能。我們希望這項研究能激發MLLMs的進一步發展。代碼、數據和模型將被公開發布。

English

Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.

通過混合偏好優化來增強多模態大型語言模型的推理能力

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

摘要

Summary

Support