ChatPaper.aiChatPaper

TAPNext:將任意點追蹤(TAP)作為下一個標記預測

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

April 8, 2025
作者: Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, Ross Goroshin
cs.AI

摘要

在視頻中追蹤任意點(Tracking Any Point, TAP)是一個具有挑戰性的計算機視覺問題,在機器人技術、視頻編輯和3D重建等領域已展現出多種應用。現有的TAP方法嚴重依賴於複雜的追蹤特定歸納偏見和啟發式方法,這限制了它們的通用性和擴展潛力。為應對這些挑戰,我們提出了TAPNext,這是一種將TAP轉化為序列化掩碼令牌解碼的新方法。我們的模型具有因果性,以純粹在線的方式進行追蹤,並消除了追蹤特定的歸納偏見。這使得TAPNext能夠以最小的延遲運行,並消除了許多現有最先進追蹤器所需的時間窗口限制。儘管其設計簡潔,TAPNext在在線和離線追蹤器中均達到了新的最佳追蹤性能。最後,我們提供的證據表明,許多廣泛使用的追蹤啟發式方法通過端到端訓練在TAPNext中自然湧現。
English
Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.

Summary

AI-Generated Summary

PDF42April 11, 2025