TAPNext：將任意點追蹤（TAP）作為下一個標記預測

摘要

在視頻中追蹤任意點（Tracking Any Point, TAP）是一個具有挑戰性的計算機視覺問題，在機器人技術、視頻編輯和3D重建等領域已展現出多種應用。現有的TAP方法嚴重依賴於複雜的追蹤特定歸納偏見和啟發式方法，這限制了它們的通用性和擴展潛力。為應對這些挑戰，我們提出了TAPNext，這是一種將TAP轉化為序列化掩碼令牌解碼的新方法。我們的模型具有因果性，以純粹在線的方式進行追蹤，並消除了追蹤特定的歸納偏見。這使得TAPNext能夠以最小的延遲運行，並消除了許多現有最先進追蹤器所需的時間窗口限制。儘管其設計簡潔，TAPNext在在線和離線追蹤器中均達到了新的最佳追蹤性能。最後，我們提供的證據表明，許多廣泛使用的追蹤啟發式方法通過端到端訓練在TAPNext中自然湧現。

English

Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.

TAPNext：將任意點追蹤（TAP）作為下一個標記預測

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

摘要

Summary

Support

Support