大规模图像和视频的通用对象基础模型

摘要

本文介绍了GLEE，这是一个用于在图像和视频中定位和识别对象的对象级基础模型。通过统一框架，GLEE实现了在开放世界场景中检测、分割、跟踪、定位和识别任意对象的能力，适用于各种对象感知任务。采用一致的学习策略，GLEE从不同监督级别的多样数据源中获取知识，形成通用的对象表示，在零样本迁移到新数据和任务时表现出色。具体来说，我们采用图像编码器、文本编码器和视觉提示器处理多模态输入，使其能够同时解决各种以对象为中心的下游任务，同时保持最先进的性能。通过在来自不同基准的五百万张图像上进行广泛训练，GLEE表现出卓越的多功能性和改进的泛化性能，有效地处理下游任务，无需特定任务的适应。通过集成大量自动标记的数据，我们进一步增强了其零样本泛化能力。此外，GLEE可以集成到大型语言模型中，作为一个基础模型，为多模态任务提供通用的对象级信息。我们希望我们方法的多功能性和通用性将在为AGI系统开发高效的视觉基础模型方面迈出重要一步。模型和代码将在https://glee-vision.github.io 上发布。

English

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

大规模图像和视频的通用对象基础模型

General Object Foundation Model for Images and Videos at Scale

摘要

Summary

Support