Fortschreitendes sprachgesteuertes visuelles Lernen für Multi-Task-Visual Grounding

Zusammenfassung

Multi-Task Visual Grounding (MTVG) umfasst zwei Teilaufgaben: Referring Expression Comprehension (REC) und Referring Expression Segmentation (RES). Die bestehenden repräsentativen Ansätze folgen im Allgemeinen einem Forschungspipeline, die hauptsächlich aus drei Kernprozessen besteht: unabhängige Merkmalsextraktion für die visuellen und linguistischen Modalitäten, ein Cross-Modal-Interaktionsmodul und unabhängige Vorhersageköpfe für verschiedene Teilaufgaben. Obwohl bemerkenswerte Leistungen erzielt werden, weist diese Forschungsrichtung zwei Einschränkungen auf: 1) Der linguistische Inhalt wurde nicht vollständig in das gesamte visuelle Backbone integriert, um eine effektivere visuelle Merkmalsextraktion zu fördern, und es wird ein zusätzliches Cross-Modal-Interaktionsmodul benötigt; 2) Die Beziehung zwischen den REC- und RES-Aufgaben wird nicht effektiv genutzt, um die gemeinsame Vorhersage für genauere Ergebnisse zu unterstützen. Um diese Probleme zu lösen, schlagen wir in diesem Artikel ein Progressive Language-guided Visual Learning Framework für Multi-Task Visual Grounding vor, genannt PLVL, das nicht nur die inhärente Merkmalsdarstellung der visuellen Modalität selbst fein ausnutzt, sondern auch schrittweise Sprachinformationen einfügt, um linguistisch bezogene visuelle Merkmale zu erlernen. Auf diese Weise benötigt unser PLVL kein zusätzliches Cross-Modal-Fusionsmodul, während die Sprachführung vollständig eingeführt wird. Darüber hinaus analysieren wir, dass das Lokalisierungszentrum für REC in gewissem Maße dazu beitragen würde, die zu segmentierende Objektregion für RES zu identifizieren. Inspiriert von dieser Untersuchung entwerfen wir einen Multi-Task-Kopf, um gemeinsame Vorhersagen für diese beiden Teilaufgaben zu ermöglichen. Umfangreiche Experimente, die auf mehreren Benchmark-Datensätzen durchgeführt wurden, bestätigen umfassend, dass unser PLVL die repräsentativen Methoden sowohl in den REC- als auch in den RES-Aufgaben deutlich übertrifft. https://github.com/jcwang0602/PLVL

English

Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

Fortschreitendes sprachgesteuertes visuelles Lernen für Multi-Task-Visual Grounding

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

Zusammenfassung

Summary

Support

Support