任务偏好优化:通过视觉任务对齐改进多模态大语言模型
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
December 26, 2024
作者: Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang
cs.AI
摘要
当前的多模态大型语言模型(MLLMs)虽然在视觉感知和推理的各种视觉应用中表现出全面的认知能力,但在对视觉的细粒度或精确理解方面仍存在困难。最近的研究要么开发工具使用,要么将特定视觉任务统一到自回归框架中,但往往以牺牲整体多模态性能为代价。为了解决这一问题,并以可扩展的方式增强MLLMs的视觉任务,我们提出了任务偏好优化(TPO),这是一种利用典型细粒度视觉任务导出的可微任务偏好的新方法。TPO引入了可学习的任务标记,建立了多个任务特定头部和MLLM之间的连接。通过在训练过程中利用丰富的视觉标签,TPO显著增强了MLLM的多模态能力和任务特定性能。通过TPO内的多任务共同训练,我们观察到协同效应的好处,提升了单个任务性能,超出了通过单任务训练方法可实现的范围。我们在VideoChat和LLaVA上实例化了这种方法,相比基线模型,整体多模态性能提高了14.6%。此外,MLLM-TPO展示了在各种任务上的强大零样本能力,表现与最先进的监督模型相当。代码将在 https://github.com/OpenGVLab/TPO 上发布。
English
Current multimodal large language models (MLLMs) struggle with fine-grained
or precise understanding of visuals though they give comprehensive perception
and reasoning in a spectrum of vision applications. Recent studies either
develop tool-using or unify specific visual tasks into the autoregressive
framework, often at the expense of overall multimodal performance. To address
this issue and enhance MLLMs with visual tasks in a scalable fashion, we
propose Task Preference Optimization (TPO), a novel method that utilizes
differentiable task preferences derived from typical fine-grained visual tasks.
TPO introduces learnable task tokens that establish connections between
multiple task-specific heads and the MLLM. By leveraging rich visual labels
during training, TPO significantly enhances the MLLM's multimodal capabilities
and task-specific performance. Through multi-task co-training within TPO, we
observe synergistic benefits that elevate individual task performance beyond
what is achievable through single-task training methodologies. Our
instantiation of this approach with VideoChat and LLaVA demonstrates an overall
14.6% improvement in multimodal performance compared to baseline models.
Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across
various tasks, performing comparably to state-of-the-art supervised models. The
code will be released at https://github.com/OpenGVLab/TPOSummary
AI-Generated Summary