ChatPaper.aiChatPaper

REF-VLM:基于三元组的统一视觉解码指代表征范式

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

March 10, 2025
作者: Yan Tai, Luhao Zhu, Zhiqiang Chen, Ynan Ding, Yiying Dong, Xiaohong Liu, Guodong Guo
cs.AI

摘要

多模态大语言模型(MLLMs)在超大规模数据集训练后,展现出跨多种视觉-语言任务的强大零样本能力。然而,对于密集预测任务,如语义分割和关键点检测,仅以文本输出形式呈现时,MLLMs面临显著挑战。同时,当前利用潜在嵌入进行视觉任务解码的MLLMs,普遍表现出在多任务学习和多粒度场景下的适应性有限。本研究中,我们提出了REF-VLM,一个用于统一训练多种视觉解码任务的端到端框架。针对复杂的视觉解码场景,我们引入了基于三元组的参考范式(TRP),通过三元结构明确解耦视觉解码任务中的三个关键维度:概念、解码类型和目标。TRP采用符号分隔符强化结构化表示学习,提升模型输出的可解析性和可解释性。此外,我们构建了视觉任务指令跟随数据集(VTInstruct),这是一个包含超过1亿条跨25种任务类型的多模态对话样本的大规模多任务数据集。除了文本输入和输出,VT-Instruct还整合了多种视觉提示,如点、框、涂鸦和掩码,并生成由文本和视觉单元(如框、关键点、深度和掩码)组成的输出。不同视觉提示与视觉单元的组合生成了多样化的任务类型,显著扩展了REF-VLM的适用性。定性与定量实验均表明,我们的REF-VLM在多种标准基准测试中优于其他MLLMs。代码、数据集及演示详见https://github.com/MacavityT/REF-VLM。
English
Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at https://github.com/MacavityT/REF-VLM.

Summary

AI-Generated Summary

PDF11March 11, 2025