ChatPaper.aiChatPaper

渐进式语言引导的视觉学习在多任务视觉定位中的应用

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

April 22, 2025
作者: Jingchao Wang, Hong Wang, Wenlong Zhang, Kunhua Ji, Dingjiang Huang, Yefeng Zheng
cs.AI

摘要

多任务视觉定位(MTVG)包含两个子任务,即指代表达理解(REC)和指代表达分割(RES)。现有代表性方法通常遵循的研究流程主要包括三个核心步骤:分别为视觉与语言模态的独立特征提取、跨模态交互模块,以及针对不同子任务的独立预测头。尽管取得了显著性能,这一研究路线存在两个局限:1)语言内容尚未充分注入整个视觉骨干网络以促进更有效的视觉特征提取,且需要额外的跨模态交互模块;2)REC与RES任务间的关系未被有效利用以辅助协同预测,从而获得更精确的输出。为解决这些问题,本文提出了一种渐进式语言引导视觉学习框架,名为PLVL,用于多任务视觉定位。该框架不仅精细挖掘视觉模态自身的固有特征表达,还逐步注入语言信息以辅助学习与语言相关的视觉特征。如此,我们的PLVL无需额外跨模态融合模块,同时充分引入了语言指导。此外,我们分析发现,REC的定位中心在一定程度上能帮助识别RES待分割的目标区域。受此启发,我们设计了一个多任务头,以实现这两个子任务的协同预测。在多个基准数据集上进行的大量实验全面证实,我们的PLVL在REC和RES任务上均明显优于代表性方法。 https://github.com/jcwang0602/PLVL
English
Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

Summary

AI-Generated Summary

PDF11April 24, 2025