PerceptionLM:開放式視覺理解數據集與模型
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
April 17, 2025
作者: Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer
cs.AI
摘要
視覺語言模型是計算機視覺研究的核心組成部分,然而許多高性能模型仍保持閉源,隱藏了其數據、設計和訓練方法。研究界對此的回應是通過從黑箱模型中進行蒸餾來標註訓練數據,從而取得強勁的基準測試結果,但這是以可衡量的科學進步為代價的。然而,若不了解教師模型的細節及其數據來源,科學進步仍難以衡量。本文中,我們研究在一個完全開放且可重現的框架下構建感知語言模型(PLM),以促進圖像和視頻理解領域的透明研究。我們分析了不依賴於專有模型蒸餾的標準訓練流程,並探索大規模合成數據以識別關鍵數據缺口,特別是在細粒度視頻理解方面。為彌補這些缺口,我們發布了280萬條人工標註的細粒度視頻問答對及時空定位的視頻描述。此外,我們引入了PLM-VideoBench,這是一套專注於評估視頻理解中“什麼”、“在哪裡”、“何時”和“如何”推理能力的挑戰性任務的測試集。我們通過提供數據、訓練方法、代碼和模型,確保我們的工作完全可重現。
English
Vision-language models are integral to computer vision research, yet many
high-performing models remain closed-source, obscuring their data, design and
training recipe. The research community has responded by using distillation
from black-box models to label training data, achieving strong benchmark
results, at the cost of measurable scientific progress. However, without
knowing the details of the teacher model and its data sources, scientific
progress remains difficult to measure. In this paper, we study building a
Perception Language Model (PLM) in a fully open and reproducible framework for
transparent research in image and video understanding. We analyze standard
training pipelines without distillation from proprietary models and explore
large-scale synthetic data to identify critical data gaps, particularly in
detailed video understanding. To bridge these gaps, we release 2.8M
human-labeled instances of fine-grained video question-answer pairs and
spatio-temporally grounded video captions. Additionally, we introduce
PLM-VideoBench, a suite for evaluating challenging video understanding tasks
focusing on the ability to reason about "what", "where", "when", and "how" of a
video. We make our work fully reproducible by providing data, training recipes,
code & models.Summary
AI-Generated Summary