任務偏好優化：通過視覺任務對齊改進多模式大型語言模型

摘要

目前的多模式大型語言模型（MLLMs）雖然在視覺理解方面存在細緻或精確的困難，但在各種視覺應用中提供了全面的知覺和推理。最近的研究要麼發展工具使用，要麼將特定視覺任務統一到自回歸框架中，但這往往是以整體多模式性能為代價的。為了解決這個問題並以可擴展的方式增強具有視覺任務的MLLMs，我們提出了任務偏好優化（TPO），這是一種利用從典型的細粒度視覺任務中衍生的可微分任務偏好的新方法。TPO引入了可學習的任務標記，建立了多個任務特定頭部與MLLM之間的連接。通過在訓練過程中利用豐富的視覺標籤，TPO顯著增強了MLLM的多模式能力和任務特定性能。通過TPO中的多任務共同訓練，我們觀察到協同效應的好處，提高了單任務訓練方法無法達到的個別任務性能。我們將此方法與VideoChat和LLaVA結合，整體多模式性能相比基準模型提高了14.6%。此外，MLLM-TPO展示了在各種任務上的強大零-shot能力，表現與最先進的監督模型相當。代碼將在https://github.com/OpenGVLab/TPO 上發布。

English

Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at https://github.com/OpenGVLab/TPO

任務偏好優化：通過視覺任務對齊改進多模式大型語言模型

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

摘要

Summary

Support