MM-IFEngine:迈向多模态指令跟随
MM-IFEngine: Towards Multimodal Instruction Following
April 10, 2025
作者: Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang
cs.AI
摘要
指令跟随(Instruction Following, IF)能力衡量了多模态大语言模型(Multi-modal Large Language Models, MLLMs)在准确理解用户指令并正确执行方面的表现。现有的多模态指令跟随训练数据稀缺,基准测试仅包含简单的原子指令,且对于要求精确输出约束的任务,评估策略不够精确。为解决这一问题,我们提出了MM-IFEngine,一个高效生成高质量图像-指令对的流程。我们的MM-IFEngine流程生成了大规模、多样化且高质量的MM-IFInstruct-23k训练数据,适用于监督微调(Supervised Fine-Tuning, SFT),并扩展为MM-IFDPO-23k用于直接偏好优化(Direct Preference Optimization, DPO)。我们进一步引入了MM-IFEval,一个具有挑战性和多样性的多模态指令跟随基准测试,包括(1)针对输出响应的组合级约束和与输入图像相关的感知级约束,以及(2)结合基于规则的评估和评判模型的综合评估流程。我们进行了SFT和DPO实验,结果表明,在MM-IFInstruct-23k和MM-IFDPO-23k上微调MLLMs,在多个IF基准测试上取得了显著提升,如MM-IFEval(+10.2%)、MIA(+7.6%)和IFEval(+12.3%)。完整数据和评估代码将发布于https://github.com/SYuan03/MM-IFEngine。
English
The Instruction Following (IF) ability measures how well Multi-modal Large
Language Models (MLLMs) understand exactly what users are telling them and
whether they are doing it right. Existing multimodal instruction following
training data is scarce, the benchmarks are simple with atomic instructions,
and the evaluation strategies are imprecise for tasks demanding exact output
constraints. To address this, we present MM-IFEngine, an effective pipeline to
generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields
large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which
is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for
Direct Preference Optimization (DPO). We further introduce MM-IFEval, a
challenging and diverse multi-modal instruction-following benchmark that
includes (1) both compose-level constraints for output responses and
perception-level constraints tied to the input images, and (2) a comprehensive
evaluation pipeline incorporating both rule-based assessment and judge model.
We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on
MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF
benchmarks, such as MM-IFEval (+10.2%), MIA (+7.6%), and IFEval
(+12.3%). The full data and evaluation code will be released on
https://github.com/SYuan03/MM-IFEngine.Summary
AI-Generated Summary