TAPNext:將任意點追蹤(TAP)作為下一個標記預測
TAPNext: Tracking Any Point (TAP) as Next Token Prediction
April 8, 2025
作者: Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, Ross Goroshin
cs.AI
摘要
在視頻中追蹤任意點(Tracking Any Point, TAP)是一個具有挑戰性的計算機視覺問題,在機器人技術、視頻編輯和3D重建等領域已展現出多種應用。現有的TAP方法嚴重依賴於複雜的追蹤特定歸納偏見和啟發式方法,這限制了它們的通用性和擴展潛力。為應對這些挑戰,我們提出了TAPNext,這是一種將TAP轉化為序列化掩碼令牌解碼的新方法。我們的模型具有因果性,以純粹在線的方式進行追蹤,並消除了追蹤特定的歸納偏見。這使得TAPNext能夠以最小的延遲運行,並消除了許多現有最先進追蹤器所需的時間窗口限制。儘管其設計簡潔,TAPNext在在線和離線追蹤器中均達到了新的最佳追蹤性能。最後,我們提供的證據表明,許多廣泛使用的追蹤啟發式方法通過端到端訓練在TAPNext中自然湧現。
English
Tracking Any Point (TAP) in a video is a challenging computer vision problem
with many demonstrated applications in robotics, video editing, and 3D
reconstruction. Existing methods for TAP rely heavily on complex
tracking-specific inductive biases and heuristics, limiting their generality
and potential for scaling. To address these challenges, we present TAPNext, a
new approach that casts TAP as sequential masked token decoding. Our model is
causal, tracks in a purely online fashion, and removes tracking-specific
inductive biases. This enables TAPNext to run with minimal latency, and removes
the temporal windowing required by many existing state of art trackers. Despite
its simplicity, TAPNext achieves a new state-of-the-art tracking performance
among both online and offline trackers. Finally, we present evidence that many
widely used tracking heuristics emerge naturally in TAPNext through end-to-end
training.Summary
AI-Generated Summary