利用視覺專家進行多模態感知的描述性標題增強
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
December 18, 2024
作者: Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang
cs.AI
摘要
訓練大型多模式模型(LMMs)依賴描述性圖像標題,以連接圖像和語言。現有方法要麼從LMM模型中提煉標題,要麼從互聯網圖像或人工構建標題。我們提出利用從注釋圖像中訓練的現成視覺專家,這些專家最初並非為圖像標題而訓練,以增強圖像標題。
我們的方法名為DCE,探索物體的低級和細粒度屬性(例如深度、情感和細粒度類別)以及物體關係(例如相對位置和人物-物體互動(HOI)),並將這些屬性結合到描述性標題中。實驗表明,這些視覺專家能夠提高視覺理解任務的性能,以及從更準確的視覺理解中受益的推理。我們將釋出源代碼和流程,以便其他視覺專家可以輕鬆結合到流程中。DCE流程的完整源代碼和數據集將在https://github.com/syp2ysy/DCE 提供。
English
Training Large Multimodality Models (LMMs) relies on descriptive image
caption that connects image and language. Existing methods either distill the
caption from the LMM models or construct the captions from the internet images
or by human. We propose to leverage off-the-shelf visual specialists, which
were trained from annotated images initially not for image captioning, for
enhancing the image caption.
Our approach, named DCE, explores object low-level and fine-grained
attributes (e.g., depth, emotion and fine-grained categories) and object
relations (e.g., relative location and human-object-interaction (HOI)), and
combine the attributes into the descriptive caption. Experiments demonstrate
that such visual specialists are able to improve the performance for visual
understanding tasks as well as reasoning that benefits from more accurate
visual understanding. We will release the source code and the pipeline so that
other visual specialists are easily combined into the pipeline. The complete
source code of DCE pipeline and datasets will be available at
https://github.com/syp2ysy/DCE.Summary
AI-Generated Summary