任务偏好优化：通过视觉任务对齐改进多模态大语言模型

摘要

当前的多模态大型语言模型（MLLMs）虽然在视觉感知和推理的各种视觉应用中表现出全面的认知能力，但在对视觉的细粒度或精确理解方面仍存在困难。最近的研究要么开发工具使用，要么将特定视觉任务统一到自回归框架中，但往往以牺牲整体多模态性能为代价。为了解决这一问题，并以可扩展的方式增强MLLMs的视觉任务，我们提出了任务偏好优化（TPO），这是一种利用典型细粒度视觉任务导出的可微任务偏好的新方法。TPO引入了可学习的任务标记，建立了多个任务特定头部和MLLM之间的连接。通过在训练过程中利用丰富的视觉标签，TPO显著增强了MLLM的多模态能力和任务特定性能。通过TPO内的多任务共同训练，我们观察到协同效应的好处，提升了单个任务性能，超出了通过单任务训练方法可实现的范围。我们在VideoChat和LLaVA上实例化了这种方法，相比基线模型，整体多模态性能提高了14.6%。此外，MLLM-TPO展示了在各种任务上的强大零样本能力，表现与最先进的监督模型相当。代码将在 https://github.com/OpenGVLab/TPO 上发布。

English

Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at https://github.com/OpenGVLab/TPO

任务偏好优化：通过视觉任务对齐改进多模态大语言模型

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

摘要

Summary

Support