PerceptionLM：開放式視覺理解數據集與模型

摘要

視覺語言模型是計算機視覺研究的核心組成部分，然而許多高性能模型仍保持閉源，隱藏了其數據、設計和訓練方法。研究界對此的回應是通過從黑箱模型中進行蒸餾來標註訓練數據，從而取得強勁的基準測試結果，但這是以可衡量的科學進步為代價的。然而，若不了解教師模型的細節及其數據來源，科學進步仍難以衡量。本文中，我們研究在一個完全開放且可重現的框架下構建感知語言模型（PLM），以促進圖像和視頻理解領域的透明研究。我們分析了不依賴於專有模型蒸餾的標準訓練流程，並探索大規模合成數據以識別關鍵數據缺口，特別是在細粒度視頻理解方面。為彌補這些缺口，我們發布了280萬條人工標註的細粒度視頻問答對及時空定位的視頻描述。此外，我們引入了PLM-VideoBench，這是一套專注於評估視頻理解中“什麼”、“在哪裡”、“何時”和“如何”推理能力的挑戰性任務的測試集。我們通過提供數據、訓練方法、代碼和模型，確保我們的工作完全可重現。

English

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.

PerceptionLM：開放式視覺理解數據集與模型

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

摘要

Summary

Support

Support