大规模图像和视频的通用对象基础模型
General Object Foundation Model for Images and Videos at Scale
December 14, 2023
作者: Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai
cs.AI
摘要
本文介绍了GLEE,这是一个用于在图像和视频中定位和识别对象的对象级基础模型。通过统一框架,GLEE实现了在开放世界场景中检测、分割、跟踪、定位和识别任意对象的能力,适用于各种对象感知任务。采用一致的学习策略,GLEE从不同监督级别的多样数据源中获取知识,形成通用的对象表示,在零样本迁移到新数据和任务时表现出色。具体来说,我们采用图像编码器、文本编码器和视觉提示器处理多模态输入,使其能够同时解决各种以对象为中心的下游任务,同时保持最先进的性能。通过在来自不同基准的五百万张图像上进行广泛训练,GLEE表现出卓越的多功能性和改进的泛化性能,有效地处理下游任务,无需特定任务的适应。通过集成大量自动标记的数据,我们进一步增强了其零样本泛化能力。此外,GLEE可以集成到大型语言模型中,作为一个基础模型,为多模态任务提供通用的对象级信息。我们希望我们方法的多功能性和通用性将在为AGI系统开发高效的视觉基础模型方面迈出重要一步。模型和代码将在https://glee-vision.github.io 上发布。
English
We present GLEE in this work, an object-level foundation model for locating
and identifying objects in images and videos. Through a unified framework, GLEE
accomplishes detection, segmentation, tracking, grounding, and identification
of arbitrary objects in the open world scenario for various object perception
tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from
diverse data sources with varying supervision levels to formulate general
object representations, excelling in zero-shot transfer to new data and tasks.
Specifically, we employ an image encoder, text encoder, and visual prompter to
handle multi-modal inputs, enabling to simultaneously solve various
object-centric downstream tasks while maintaining state-of-the-art performance.
Demonstrated through extensive training on over five million images from
diverse benchmarks, GLEE exhibits remarkable versatility and improved
generalization performance, efficiently tackling downstream tasks without the
need for task-specific adaptation. By integrating large volumes of
automatically labeled data, we further enhance its zero-shot generalization
capabilities. Additionally, GLEE is capable of being integrated into Large
Language Models, serving as a foundational model to provide universal
object-level information for multi-modal tasks. We hope that the versatility
and universality of our method will mark a significant step in the development
of efficient visual foundation models for AGI systems. The model and code will
be released at https://glee-vision.github.io .Summary
AI-Generated Summary