利用视觉专家进行多模态感知的描述性标题增强

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

December 18, 2024
作者: Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang
cs.AI

摘要

训练大型多模态模型(LMMs)依赖描述性图像标题,将图像和语言连接起来。现有方法要么从LMM模型中提炼标题,要么从互联网图像或人工构建标题。我们提出利用现成的视觉专家,这些专家最初是从带注释的图像中训练出来的,而不是用于图像字幕,以增强图像字幕。 我们的方法名为DCE,探索物体的低级和细粒度属性(例如深度、情感和细粒度类别)以及物体关系(例如相对位置和人物-物体交互(HOI)),并将这些属性结合到描述性字幕中。实验证明,这种视觉专家能够提高视觉理解任务的性能,以及从更准确的视觉理解中获益的推理。我们将发布源代码和流程,以便其他视觉专家可以轻松地结合到流程中。DCE流程和数据集的完整源代码将在https://github.com/syp2ysy/DCE 上提供。
English
Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at https://github.com/syp2ysy/DCE.

Summary

AI-Generated Summary

PDF62December 20, 2024