REF-VLM:基于三元组的统一视觉解码指代表征范式
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
March 10, 2025
作者: Yan Tai, Luhao Zhu, Zhiqiang Chen, Ynan Ding, Yiying Dong, Xiaohong Liu, Guodong Guo
cs.AI
摘要
多模态大语言模型(MLLMs)在超大规模数据集训练后,展现出跨多种视觉-语言任务的强大零样本能力。然而,对于密集预测任务,如语义分割和关键点检测,仅以文本输出形式呈现时,MLLMs面临显著挑战。同时,当前利用潜在嵌入进行视觉任务解码的MLLMs,普遍表现出在多任务学习和多粒度场景下的适应性有限。本研究中,我们提出了REF-VLM,一个用于统一训练多种视觉解码任务的端到端框架。针对复杂的视觉解码场景,我们引入了基于三元组的参考范式(TRP),通过三元结构明确解耦视觉解码任务中的三个关键维度:概念、解码类型和目标。TRP采用符号分隔符强化结构化表示学习,提升模型输出的可解析性和可解释性。此外,我们构建了视觉任务指令跟随数据集(VTInstruct),这是一个包含超过1亿条跨25种任务类型的多模态对话样本的大规模多任务数据集。除了文本输入和输出,VT-Instruct还整合了多种视觉提示,如点、框、涂鸦和掩码,并生成由文本和视觉单元(如框、关键点、深度和掩码)组成的输出。不同视觉提示与视觉单元的组合生成了多样化的任务类型,显著扩展了REF-VLM的适用性。定性与定量实验均表明,我们的REF-VLM在多种标准基准测试中优于其他MLLMs。代码、数据集及演示详见https://github.com/MacavityT/REF-VLM。
English
Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot
capabilities across diverse vision-language tasks after training on mega-scale
datasets. However, dense prediction tasks, such as semantic segmentation and
keypoint detection, pose significant challenges for MLLMs when represented
solely as text outputs. Simultaneously, current MLLMs utilizing latent
embeddings for visual task decoding generally demonstrate limited adaptability
to both multi-task learning and multi-granularity scenarios. In this work, we
present REF-VLM, an end-to-end framework for unified training of various visual
decoding tasks. To address complex visual decoding scenarios, we introduce the
Triplet-Based Referring Paradigm (TRP), which explicitly decouples three
critical dimensions in visual decoding tasks through a triplet structure:
concepts, decoding types, and targets. TRP employs symbolic delimiters to
enforce structured representation learning, enhancing the parsability and
interpretability of model outputs. Additionally, we construct Visual-Task
Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset
containing over 100 million multimodal dialogue samples across 25 task types.
Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts
such as point, box, scribble, and mask, and generates outputs composed of text
and visual units like box, keypoint, depth and mask. The combination of
different visual prompts and visual units generates a wide variety of task
types, expanding the applicability of REF-VLM significantly. Both qualitative
and quantitative experiments demonstrate that our REF-VLM outperforms other
MLLMs across a variety of standard benchmarks. The code, dataset, and demo
available at https://github.com/MacavityT/REF-VLM.Summary
AI-Generated Summary