漸進式語言引導視覺學習在多任務視覺定位中的應用

摘要

多任务视觉定位（MTVG）包含两个子任务，即指代表达理解（REC）和指代表达分割（RES）。现有的代表性方法通常遵循一个研究流程，该流程主要由三个核心步骤组成，包括分别对视觉和语言模态进行独立特征提取、跨模态交互模块，以及针对不同子任务的独立预测头。尽管取得了显著性能，这一研究方向存在两个局限：1）语言内容尚未充分注入整个视觉骨干网络以促进更有效的视觉特征提取，且需要额外的跨模态交互模块；2）REC与RES任务之间的关系未被有效利用以辅助协同预测，从而获得更精确的输出。为解决这些问题，本文提出了一种用于多任务视觉定位的渐进式语言引导视觉学习框架，称为PLVL，它不仅精细挖掘了视觉模态本身的内在特征表达，还逐步注入语言信息以帮助学习与语言相关的视觉特征。通过这种方式，我们的PLVL无需额外的跨模态融合模块，同时充分引入了语言指导。此外，我们分析发现，REC的定位中心在一定程度上能帮助识别RES待分割的目标区域。受此启发，我们设计了一个多任务头来完成这两个子任务的协同预测。在多个基准数据集上进行的大量实验全面证实，我们的PLVL在REC和RES任务上均明显优于代表性方法。 https://github.com/jcwang0602/PLVL

English

Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

漸進式語言引導視覺學習在多任務視覺定位中的應用

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

摘要

Summary

Support

Support