视频中的任意运动分割

摘要

运动目标分割是实现高层次视觉场景理解的关键任务，并拥有众多下游应用。人类能够轻松地在视频中分割出运动物体。以往的研究主要依赖光流来提供运动线索；然而，由于部分运动、复杂形变、运动模糊及背景干扰等挑战，这种方法往往导致预测结果不尽完美。我们提出了一种新颖的运动目标分割方法，该方法结合了长程轨迹运动线索与基于DINO的语义特征，并利用SAM2通过迭代提示策略实现像素级掩码细化。我们的模型采用时空轨迹注意力机制和运动-语义解耦嵌入技术，在整合语义支持的同时优先考虑运动信息。在多种数据集上的广泛测试表明，该方法在复杂场景和多个目标的精细分割上均展现出业界领先的性能。我们的代码可在https://motion-seg.github.io/获取。

English

Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.

视频中的任意运动分割

Segment Any Motion in Videos

摘要

Summary

Support

Support