ChatPaper.aiChatPaper

PerceptionLM:面向精细视觉理解的开源数据集与模型

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

April 17, 2025
作者: Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer
cs.AI

摘要

视觉语言模型是计算机视觉研究的重要组成部分,然而许多高性能模型仍保持闭源状态,其数据、设计和训练方法均不透明。研究界通过从黑箱模型中提取知识来标注训练数据作为回应,虽在基准测试中取得了优异成果,却以可衡量的科学进展为代价。然而,在不了解教师模型及其数据源细节的情况下,科学进步仍难以量化。本文探讨了在一个完全开放且可复现的框架下构建感知语言模型(PLM),旨在推动图像与视频理解领域的透明研究。我们分析了不依赖专有模型知识蒸馏的标准训练流程,并探索大规模合成数据以识别关键数据缺口,特别是在细粒度视频理解方面。为填补这些缺口,我们发布了280万条人工标注的精细视频问答对及时空定位视频描述。此外,我们推出了PLM-VideoBench,一套专注于评估视频理解中“什么”、“哪里”、“何时”及“如何”推理能力的挑战性任务集。为确保研究的完全可复现性,我们提供了数据、训练方案、代码及模型。
English
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.

Summary

AI-Generated Summary

PDF162April 18, 2025