視頻萬物皆可描述:基於時空多模態提示的細粒度物體中心描述生成
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
April 7, 2025
作者: Yunlong Tang, Jing Bi, Chao Huang, Susan Liang, Daiki Shimada, Hang Hua, Yunzhong Xiao, Yizhi Song, Pinxin Liu, Mingqian Feng, Junjia Guo, Zhuo Liu, Luchuan Song, Ali Vosoughi, Jinxi He, Liu He, Zeliang Zhang, Jiebo Luo, Chenliang Xu
cs.AI
摘要
我們提出了CAT-V(視頻中任意物體字幕生成),這是一個無需訓練的框架,專注於細粒度、以物體為中心的視頻字幕生成,能夠對用戶選定的物體進行跨時間的詳細描述。CAT-V整合了三個關鍵組件:基於SAMURAI的分割器,用於跨幀的精確物體分割;由TRACE-Uni驅動的時間分析器,用於準確的事件邊界檢測和時間分析;以及使用InternVL-2.5的字幕生成器,用於生成詳細的物體中心描述。通過時空視覺提示和思維鏈推理,我們的框架能夠生成詳細、具有時間感知的物體屬性、動作、狀態、互動及環境背景描述,而無需額外的訓練數據。CAT-V支持通過多種視覺提示(點、邊界框和不規則區域)進行靈活的用戶交互,並通過追蹤不同時間段內物體狀態和互動來保持時間敏感性。我們的方法解決了現有視頻字幕生成方法的局限性,這些方法要么生成過於抽象的描述,要么缺乏物體層面的精確性,從而實現了細粒度、特定物體的描述,同時保持了時間連貫性和空間準確性。本項目的GitHub倉庫地址為https://github.com/yunlong10/CAT-V。
English
We present CAT-V (Caption AnyThing in Video), a training-free framework for
fine-grained object-centric video captioning that enables detailed descriptions
of user-selected objects through time. CAT-V integrates three key components: a
Segmenter based on SAMURAI for precise object segmentation across frames, a
Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection
and temporal analysis, and a Captioner using InternVL-2.5 for generating
detailed object-centric descriptions. Through spatiotemporal visual prompts and
chain-of-thought reasoning, our framework generates detailed, temporally-aware
descriptions of objects' attributes, actions, statuses, interactions, and
environmental contexts without requiring additional training data. CAT-V
supports flexible user interactions through various visual prompts (points,
bounding boxes, and irregular regions) and maintains temporal sensitivity by
tracking object states and interactions across different time segments. Our
approach addresses limitations of existing video captioning methods, which
either produce overly abstract descriptions or lack object-level precision,
enabling fine-grained, object-specific descriptions while maintaining temporal
coherence and spatial accuracy. The GitHub repository for this project is
available at https://github.com/yunlong10/CAT-VSummary
AI-Generated Summary