ChatPaper.aiChatPaper

视频动作差分

Video Action Differencing

March 10, 2025
作者: James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
cs.AI

摘要

两个人在执行相同动作时有何差异?在本研究中,我们提出了视频动作差异识别(VidDiff)这一新颖任务,旨在识别同一动作视频间的细微差别,该任务在教练指导与技能学习等领域具有广泛应用。为促进这一新任务的开发,我们首先构建了VidDiffBench基准数据集,包含549对视频,并提供了4,469条细粒度动作差异的人工标注以及2,075个定位时间戳,指明这些差异出现的位置。实验表明,VidDiffBench对GPT-4o和Qwen2-VL等当前最先进的大型多模态模型(LMMs)构成了显著挑战。通过分析LMMs在VidDiffBench上的失败案例,我们揭示了该任务面临的两大关键挑战:跨视频相关子动作的定位与细粒度帧对比。为克服这些挑战,我们提出了VidDiff方法,一种将任务分解为三个阶段的代理工作流程:动作差异提议、关键帧定位及帧差异分析,每个阶段均采用专门的基础模型。为鼓励未来在这一新任务上的研究,我们已在https://huggingface.co/datasets/jmhb/VidDiffBench发布基准数据集,并在http://jmhb0.github.io/viddiff公开了代码。
English
How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.

Summary

AI-Generated Summary

PDF282March 12, 2025