探尋任意視頻中的攝像機運動規律

摘要

我們推出了CameraBench，這是一個大規模的數據集和基準測試，旨在評估和提升對相機運動的理解。CameraBench包含約3,000個多樣化的網絡視頻，這些視頻經過專家通過嚴格的多階段質量控制流程進行註釋。我們的一個貢獻是與電影攝影師合作設計的相機運動基本元素分類法。我們發現，例如“跟隨”（或追蹤）等某些運動需要理解場景內容，如移動的主體。我們進行了一項大規模的人類研究，以量化人類註釋的表現，揭示了領域專業知識和基於教程的培訓可以顯著提高準確性。例如，新手可能會將“放大”（內在參數的變化）與“向前平移”（外在參數的變化）混淆，但可以通過培訓來區分這兩者。利用CameraBench，我們評估了結構從運動（SfM）和視頻語言模型（VLMs），發現SfM模型難以捕捉依賴於場景內容的語義基本元素，而VLMs則難以捕捉需要精確估計軌跡的幾何基本元素。然後，我們在CameraBench上微調了一個生成式VLM，以實現兩者的最佳結合，並展示了其應用，包括運動增強的字幕生成、視頻問答和視頻文本檢索。我們希望我們的分類法、基準測試和教程將推動未來努力，實現理解任何視頻中相機運動的最終目標。

English

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

探尋任意視頻中的攝像機運動規律

Towards Understanding Camera Motions in Any Video

摘要

Summary

Support

Support