ChatPaper.aiChatPaper

TAPNext:将任意点追踪(TAP)转化为下一标记预测

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

April 8, 2025
作者: Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, Ross Goroshin
cs.AI

摘要

视频中的任意点追踪(TAP)是一项具有挑战性的计算机视觉任务,在机器人技术、视频编辑和三维重建等领域展现了广泛的应用前景。现有的TAP方法严重依赖于复杂的追踪特定归纳偏置和启发式规则,这限制了其通用性和扩展潜力。为应对这些挑战,我们提出了TAPNext,一种将TAP视为序列掩码令牌解码的新方法。我们的模型具有因果性,采用纯在线方式进行追踪,并消除了追踪特定的归纳偏置。这使得TAPNext能够以极低的延迟运行,并省去了许多现有顶尖追踪器所需的时间窗口限制。尽管设计简洁,TAPNext在在线与离线追踪器中均实现了新的追踪性能标杆。最后,我们提供的证据表明,许多广泛使用的追踪启发式规则通过端到端训练在TAPNext中自然涌现。
English
Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.

Summary

AI-Generated Summary

PDF52April 11, 2025