DINO-X：用于开放世界目标检测和理解的统一视觉模型

摘要

本文介绍了DINO-X，这是由IDEA研究团队开发的统一的面向对象视觉模型，具有迄今为止最佳的开放世界目标检测性能。DINO-X采用与Grounding DINO 1.5相同的基于Transformer的编码器-解码器架构，以追求面向对象的表示，用于开放世界目标理解。为了简化长尾目标检测，DINO-X扩展了其输入选项，支持文本提示、视觉提示和定制提示。通过这种灵活的提示选项，我们开发了一个通用的对象提示，以支持无提示的开放世界检测，使得可以在图像中检测任何物体，而无需用户提供任何提示。为了增强模型的核心基础能力，我们构建了一个包含超过1亿个高质量基础样本的大规模数据集，称为Grounding-100M，用于提升模型的开放词汇检测性能。在这样一个大规模基础数据集上进行预训练，形成了一个基础的对象级表示，使得DINO-X能够集成多个感知头部，同时支持多个对象感知和理解任务，包括检测、分割、姿态估计、对象字幕、基于对象的问答等。实验结果展示了DINO-X的卓越性能。具体而言，DINO-X Pro模型在COCO、LVIS-minival和LVIS-val的零样本目标检测基准上分别达到了56.0 AP、59.8 AP和52.4 AP。值得注意的是，它在LVIS-minival和LVIS-val基准的稀有类别上分别获得了63.3 AP和56.5 AP，将先前的SOTA性能提高了5.8 AP。这一结果突显了其显著提升的长尾目标识别能力。

English

In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, both improving the previous SOTA performance by 5.8 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.

DINO-X：用于开放世界目标检测和理解的统一视觉模型

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

摘要

Summary

Support