视频万物皆可描述:通过时空多模态提示实现细粒度物体中心描述
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
April 7, 2025
作者: Yunlong Tang, Jing Bi, Chao Huang, Susan Liang, Daiki Shimada, Hang Hua, Yunzhong Xiao, Yizhi Song, Pinxin Liu, Mingqian Feng, Junjia Guo, Zhuo Liu, Luchuan Song, Ali Vosoughi, Jinxi He, Liu He, Zeliang Zhang, Jiebo Luo, Chenliang Xu
cs.AI
摘要
我们推出CAT-V(视频中任意对象描述),这是一个无需训练的框架,专为细粒度对象中心视频描述而设计,能够对用户选择的对象进行跨时间的详细描述。CAT-V集成了三大核心组件:基于SAMURAI的分割器,用于跨帧精确对象分割;由TRACE-Uni驱动的时间分析器,实现准确的事件边界检测与时间分析;以及采用InternVL-2.5的描述生成器,用于生成详细的对象中心描述。通过时空视觉提示和思维链推理,我们的框架无需额外训练数据,即可生成关于对象属性、动作、状态、交互及环境背景的详细、时间感知的描述。CAT-V支持通过多种视觉提示(点、边界框、不规则区域)进行灵活的用户交互,并通过追踪不同时间段内对象状态与交互,保持时间敏感性。我们的方法解决了现有视频描述技术存在的局限性,这些技术要么生成过于抽象的描述,要么缺乏对象级别的精确性,从而实现了在保持时间连贯性和空间准确性的同时,提供细粒度、对象特定的描述。本项目GitHub仓库地址为:https://github.com/yunlong10/CAT-V。
English
We present CAT-V (Caption AnyThing in Video), a training-free framework for
fine-grained object-centric video captioning that enables detailed descriptions
of user-selected objects through time. CAT-V integrates three key components: a
Segmenter based on SAMURAI for precise object segmentation across frames, a
Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection
and temporal analysis, and a Captioner using InternVL-2.5 for generating
detailed object-centric descriptions. Through spatiotemporal visual prompts and
chain-of-thought reasoning, our framework generates detailed, temporally-aware
descriptions of objects' attributes, actions, statuses, interactions, and
environmental contexts without requiring additional training data. CAT-V
supports flexible user interactions through various visual prompts (points,
bounding boxes, and irregular regions) and maintains temporal sensitivity by
tracking object states and interactions across different time segments. Our
approach addresses limitations of existing video captioning methods, which
either produce overly abstract descriptions or lack object-level precision,
enabling fine-grained, object-specific descriptions while maintaining temporal
coherence and spatial accuracy. The GitHub repository for this project is
available at https://github.com/yunlong10/CAT-VSummary
AI-Generated Summary