Omni-RGPT:通過標記符號統一圖像和視頻區域級理解
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
January 14, 2025
作者: Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma
cs.AI
摘要
我們提出了 Omni-RGPT,一個多模態大型語言模型,旨在促進對圖像和視頻的區域級理解。為了實現在時空維度上一致的區域表示,我們引入了 Token Mark,一組突出顯示目標區域的標記。這些標記直接嵌入到空間區域中,使用區域提示(例如框或遮罩),同時被納入文本提示中以指定目標,建立視覺和文本標記之間的直接連接。為了進一步支持強大的視頻理解,而無需軌跡片段,我們引入了一個輔助任務,通過利用標記的一致性來引導 Token Mark,從而實現視頻中穩定的區域解釋。此外,我們還介紹了一個大規模的區域級視頻指令數據集(RegVID-300k)。Omni-RGPT 在基於圖像和視頻的常識推理基準測試中取得了最先進的結果,同時在字幕生成和指代表達理解任務中表現出色。
English
We present Omni-RGPT, a multimodal large language model designed to
facilitate region-level comprehension for both images and videos. To achieve
consistent region representation across spatio-temporal dimensions, we
introduce Token Mark, a set of tokens highlighting the target regions within
the visual feature space. These tokens are directly embedded into spatial
regions using region prompts (e.g., boxes or masks) and simultaneously
incorporated into the text prompt to specify the target, establishing a direct
connection between visual and text tokens. To further support robust video
understanding without requiring tracklets, we introduce an auxiliary task that
guides Token Mark by leveraging the consistency of the tokens, enabling stable
region interpretation across the video. Additionally, we introduce a
large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT
achieves state-of-the-art results on image and video-based commonsense
reasoning benchmarks while showing strong performance in captioning and
referring expression comprehension tasks.Summary
AI-Generated Summary