Omni-RGPT:通过标记符号统一图像和视频区域级理解
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
January 14, 2025
作者: Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma
cs.AI
摘要
我们提出了Omni-RGPT,这是一个多模态大型语言模型,旨在促进对图像和视频的区域级理解。为了实现跨时空维度的一致区域表示,我们引入了Token Mark,一组突出显示视觉特征空间中目标区域的标记。这些标记直接嵌入到空间区域中,使用区域提示(例如,框或蒙版),同时并入文本提示以指定目标,建立视觉和文本标记之间的直接连接。为了进一步支持视频理解而无需轨迹片段,我们引入了一个辅助任务,通过利用标记的一致性来指导Token Mark,实现视频中稳定的区域解释。此外,我们还引入了一个大规模的区域级视频指导数据集(RegVID-300k)。Omni-RGPT在基于图像和视频的常识推理基准上取得了最先进的结果,同时在字幕生成和指代表达理解任务中表现出色。
English
We present Omni-RGPT, a multimodal large language model designed to
facilitate region-level comprehension for both images and videos. To achieve
consistent region representation across spatio-temporal dimensions, we
introduce Token Mark, a set of tokens highlighting the target regions within
the visual feature space. These tokens are directly embedded into spatial
regions using region prompts (e.g., boxes or masks) and simultaneously
incorporated into the text prompt to specify the target, establishing a direct
connection between visual and text tokens. To further support robust video
understanding without requiring tracklets, we introduce an auxiliary task that
guides Token Mark by leveraging the consistency of the tokens, enabling stable
region interpretation across the video. Additionally, we introduce a
large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT
achieves state-of-the-art results on image and video-based commonsense
reasoning benchmarks while showing strong performance in captioning and
referring expression comprehension tasks.Summary
AI-Generated Summary