DINO-X:用于开放世界目标检测和理解的统一视觉模型
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
November 21, 2024
作者: Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, Xingyu Chen, Zhuheng Song, Yuhong Zhang, Hongjie Huang, Han Gao, Shilong Liu, Hao Zhang, Feng Li, Kent Yu, Lei Zhang
cs.AI
摘要
本文介绍了DINO-X,这是由IDEA研究团队开发的统一的面向对象视觉模型,具有迄今为止最佳的开放世界目标检测性能。DINO-X采用与Grounding DINO 1.5相同的基于Transformer的编码器-解码器架构,以追求面向对象的表示,用于开放世界目标理解。为了简化长尾目标检测,DINO-X扩展了其输入选项,支持文本提示、视觉提示和定制提示。通过这种灵活的提示选项,我们开发了一个通用的对象提示,以支持无提示的开放世界检测,使得可以在图像中检测任何物体,而无需用户提供任何提示。为了增强模型的核心基础能力,我们构建了一个包含超过1亿个高质量基础样本的大规模数据集,称为Grounding-100M,用于提升模型的开放词汇检测性能。在这样一个大规模基础数据集上进行预训练,形成了一个基础的对象级表示,使得DINO-X能够集成多个感知头部,同时支持多个对象感知和理解任务,包括检测、分割、姿态估计、对象字幕、基于对象的问答等。实验结果展示了DINO-X的卓越性能。具体而言,DINO-X Pro模型在COCO、LVIS-minival和LVIS-val的零样本目标检测基准上分别达到了56.0 AP、59.8 AP和52.4 AP。值得注意的是,它在LVIS-minival和LVIS-val基准的稀有类别上分别获得了63.3 AP和56.5 AP,将先前的SOTA性能提高了5.8 AP。这一结果突显了其显著提升的长尾目标识别能力。
English
In this paper, we introduce DINO-X, which is a unified object-centric vision
model developed by IDEA Research with the best open-world object detection
performance to date. DINO-X employs the same Transformer-based encoder-decoder
architecture as Grounding DINO 1.5 to pursue an object-level representation for
open-world object understanding. To make long-tailed object detection easy,
DINO-X extends its input options to support text prompt, visual prompt, and
customized prompt. With such flexible prompt options, we develop a universal
object prompt to support prompt-free open-world detection, making it possible
to detect anything in an image without requiring users to provide any prompt.
To enhance the model's core grounding capability, we have constructed a
large-scale dataset with over 100 million high-quality grounding samples,
referred to as Grounding-100M, for advancing the model's open-vocabulary
detection performance. Pre-training on such a large-scale grounding dataset
leads to a foundational object-level representation, which enables DINO-X to
integrate multiple perception heads to simultaneously support multiple object
perception and understanding tasks, including detection, segmentation, pose
estimation, object captioning, object-based QA, etc. Experimental results
demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro
model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and
LVIS-val zero-shot object detection benchmarks, respectively. Notably, it
scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val
benchmarks, both improving the previous SOTA performance by 5.8 AP. Such a
result underscores its significantly improved capacity for recognizing
long-tailed objects.Summary
AI-Generated Summary