TAPNext:将任意点追踪(TAP)转化为下一标记预测
TAPNext: Tracking Any Point (TAP) as Next Token Prediction
April 8, 2025
作者: Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, Ross Goroshin
cs.AI
摘要
视频中的任意点追踪(TAP)是一项具有挑战性的计算机视觉任务,在机器人技术、视频编辑和三维重建等领域展现了广泛的应用前景。现有的TAP方法严重依赖于复杂的追踪特定归纳偏置和启发式规则,这限制了其通用性和扩展潜力。为应对这些挑战,我们提出了TAPNext,一种将TAP视为序列掩码令牌解码的新方法。我们的模型具有因果性,采用纯在线方式进行追踪,并消除了追踪特定的归纳偏置。这使得TAPNext能够以极低的延迟运行,并省去了许多现有顶尖追踪器所需的时间窗口限制。尽管设计简洁,TAPNext在在线与离线追踪器中均实现了新的追踪性能标杆。最后,我们提供的证据表明,许多广泛使用的追踪启发式规则通过端到端训练在TAPNext中自然涌现。
English
Tracking Any Point (TAP) in a video is a challenging computer vision problem
with many demonstrated applications in robotics, video editing, and 3D
reconstruction. Existing methods for TAP rely heavily on complex
tracking-specific inductive biases and heuristics, limiting their generality
and potential for scaling. To address these challenges, we present TAPNext, a
new approach that casts TAP as sequential masked token decoding. Our model is
causal, tracks in a purely online fashion, and removes tracking-specific
inductive biases. This enables TAPNext to run with minimal latency, and removes
the temporal windowing required by many existing state of art trackers. Despite
its simplicity, TAPNext achieves a new state-of-the-art tracking performance
among both online and offline trackers. Finally, we present evidence that many
widely used tracking heuristics emerge naturally in TAPNext through end-to-end
training.Summary
AI-Generated Summary