TAPNext：将任意点追踪（TAP）转化为下一标记预测

摘要

视频中的任意点追踪（TAP）是一项具有挑战性的计算机视觉任务，在机器人技术、视频编辑和三维重建等领域展现了广泛的应用前景。现有的TAP方法严重依赖于复杂的追踪特定归纳偏置和启发式规则，这限制了其通用性和扩展潜力。为应对这些挑战，我们提出了TAPNext，一种将TAP视为序列掩码令牌解码的新方法。我们的模型具有因果性，采用纯在线方式进行追踪，并消除了追踪特定的归纳偏置。这使得TAPNext能够以极低的延迟运行，并省去了许多现有顶尖追踪器所需的时间窗口限制。尽管设计简洁，TAPNext在在线与离线追踪器中均实现了新的追踪性能标杆。最后，我们提供的证据表明，许多广泛使用的追踪启发式规则通过端到端训练在TAPNext中自然涌现。

English

Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.

TAPNext：将任意点追踪（TAP）转化为下一标记预测

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

摘要

Summary

Support

Support