利用視覺專家進行多模態感知的描述性標題增強

摘要

訓練大型多模式模型（LMMs）依賴描述性圖像標題，以連接圖像和語言。現有方法要麼從LMM模型中提煉標題，要麼從互聯網圖像或人工構建標題。我們提出利用從注釋圖像中訓練的現成視覺專家，這些專家最初並非為圖像標題而訓練，以增強圖像標題。我們的方法名為DCE，探索物體的低級和細粒度屬性（例如深度、情感和細粒度類別）以及物體關係（例如相對位置和人物-物體互動（HOI）），並將這些屬性結合到描述性標題中。實驗表明，這些視覺專家能夠提高視覺理解任務的性能，以及從更準確的視覺理解中受益的推理。我們將釋出源代碼和流程，以便其他視覺專家可以輕鬆結合到流程中。DCE流程的完整源代碼和數據集將在https://github.com/syp2ysy/DCE 提供。

English

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at https://github.com/syp2ysy/DCE.

利用視覺專家進行多模態感知的描述性標題增強

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

摘要

Summary

Support