작업 선호도 최적화: 시각 작업 정렬을 통한 다중 모달 대규모 언어 모델 개선

초록

현재의 다중 모달 대형 언어 모델(Multimodal Large Language Models, MLLMs)은 시각적인 세밀하거나 정확한 이해에 어려움을 겪지만 다양한 시각 응용 프로그램 범위에서 포괄적인 지각과 추론을 제공합니다. 최근 연구들은 특정 시각 작업을 자동 회귀적 프레임워크로 통합하거나 도구 사용을 개발하며 종종 전체 다중 모달 성능을 희생하는 경향이 있습니다. 이 문제를 해결하고 MLLMs를 확장 가능한 방식으로 시각 작업으로 향상시키기 위해 우리는 Task Preference Optimization (TPO)이라는 새로운 방법을 제안합니다. TPO는 전형적인 세밀한 시각 작업에서 유도된 미분 가능한 작업 선호도를 활용하는 방법입니다. TPO는 학습 가능한 작업 토큰을 도입하여 여러 작업별 헤드와 MLLM 간의 연결을 설정합니다. 풍부한 시각적 레이블을 활용하여 TPO는 MLLM의 다중 모달 능력과 작업별 성능을 현저하게 향상시킵니다. TPO 내에서 다중 작업 공동 학습을 통해 우리는 단일 작업 학습 방법을 통해 달성할 수 있는 것 이상의 개별 작업 성능을 끌어올리는 상호 작용적 이점을 관찰합니다. VideoChat 및 LLaVA와 이 접근 방식의 구현은 기준 모델과 비교하여 전체적으로 다중 모달 성능이 14.6% 향상되었습니다. 또한, MLLM-TPO는 다양한 작업에서 강력한 제로샷 능력을 보여주며 최첨단 지도 모델과 유사한 성능을 발휘합니다. 코드는 https://github.com/OpenGVLab/TPO에서 공개될 예정입니다.

English

Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at https://github.com/OpenGVLab/TPO

작업 선호도 최적화: 시각 작업 정렬을 통한 다중 모달 대규모 언어 모델 개선

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

초록

Support