マルチモーダル知覚のためのビジュアルスペシャリストによる記述キャプションの向上

要旨

大規模なマルチモダリティモデル（LMMs）のトレーニングは、画像と言語を結びつける記述的な画像キャプションに依存しています。既存の手法は、LMMモデルからキャプションを抽出するか、インターネットの画像からキャプションを構築するか、あるいは人間によって行います。私たちは、画像キャプションを強化するために、元々画像キャプショニングのために訓練されていない注釈付き画像から初期に訓練された市販のビジュアルスペシャリストを活用することを提案します。　私たちの手法であるDCEは、オブジェクトの低レベルおよび細かい属性（例：深さ、感情、細かいカテゴリ）およびオブジェクト間の関係（例：相対位置および人物-オブジェクト相互作用（HOI））を探求し、これらの属性を記述的なキャプションに組み合わせます。実験では、このようなビジュアルスペシャリストが、視覚理解タスクのパフォーマンスを向上させるだけでなく、より正確な視覚理解による恩恵を受ける推論を可能にすることを示しています。他のビジュアルスペシャリストが簡単にパイプラインに組み込まれるように、ソースコードとパイプラインを公開します。DCEパイプラインとデータセットの完全なソースコードは、https://github.com/syp2ysy/DCE で入手可能です。

English

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at https://github.com/syp2ysy/DCE.

マルチモーダル知覚のためのビジュアルスペシャリストによる記述キャプションの向上

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

要旨

Summary

Support

Support