通过混合偏好优化增强多模态大型语言模型的推理能力

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

November 15, 2024
作者: Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai
cs.AI

摘要

现有的开源多模态大型语言模型(MLLMs)通常遵循包括预训练和监督微调的训练过程。然而,这些模型存在分布偏移问题,限制了它们的多模态推理能力,特别是在“思维链”(CoT)性能方面。为了解决这一问题,我们引入了一种偏好优化(PO)过程,以增强MLLMs的多模态推理能力。具体而言,(1)在数据方面,我们设计了一个自动偏好数据构建流程,创建了MMPR,一个高质量、大规模的多模态推理偏好数据集;(2)在模型方面,我们探索将PO与MLLMs集成,开发了一种简单而有效的方法,称为混合偏好优化(MPO),可以提升多模态CoT性能。我们的方法在多个基准测试中表现出改进的性能,特别是在多模态推理任务中。值得注意的是,我们的模型InternVL2-8B-MPO在MathVista上取得了67.0的准确率,比InternVL2-8B高出8.7个点,并且达到了与规模大10倍的InternVL2-76B相当的性能。我们希望这项研究能激发MLLMs的进一步发展。代码、数据和模型将会公开发布。
English
Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.

Summary

AI-Generated Summary

PDF444November 22, 2024