利用视觉专家进行多模态感知的描述性标题增强
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
December 18, 2024
作者: Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang
cs.AI
摘要
训练大型多模态模型(LMMs)依赖描述性图像标题,将图像和语言连接起来。现有方法要么从LMM模型中提炼标题,要么从互联网图像或人工构建标题。我们提出利用现成的视觉专家,这些专家最初是从带注释的图像中训练出来的,而不是用于图像字幕,以增强图像字幕。
我们的方法名为DCE,探索物体的低级和细粒度属性(例如深度、情感和细粒度类别)以及物体关系(例如相对位置和人物-物体交互(HOI)),并将这些属性结合到描述性字幕中。实验证明,这种视觉专家能够提高视觉理解任务的性能,以及从更准确的视觉理解中获益的推理。我们将发布源代码和流程,以便其他视觉专家可以轻松地结合到流程中。DCE流程和数据集的完整源代码将在https://github.com/syp2ysy/DCE 上提供。
English
Training Large Multimodality Models (LMMs) relies on descriptive image
caption that connects image and language. Existing methods either distill the
caption from the LMM models or construct the captions from the internet images
or by human. We propose to leverage off-the-shelf visual specialists, which
were trained from annotated images initially not for image captioning, for
enhancing the image caption.
Our approach, named DCE, explores object low-level and fine-grained
attributes (e.g., depth, emotion and fine-grained categories) and object
relations (e.g., relative location and human-object-interaction (HOI)), and
combine the attributes into the descriptive caption. Experiments demonstrate
that such visual specialists are able to improve the performance for visual
understanding tasks as well as reasoning that benefits from more accurate
visual understanding. We will release the source code and the pipeline so that
other visual specialists are easily combined into the pipeline. The complete
source code of DCE pipeline and datasets will be available at
https://github.com/syp2ysy/DCE.Summary
AI-Generated Summary