DINO-X:一個統一的視覺模型,用於開放世界的物體檢測和理解。

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

November 21, 2024
作者: Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, Xingyu Chen, Zhuheng Song, Yuhong Zhang, Hongjie Huang, Han Gao, Shilong Liu, Hao Zhang, Feng Li, Kent Yu, Lei Zhang
cs.AI

摘要

本文介紹了由 IDEA 研究團隊開發的 DINO-X,這是一個統一的以物件為中心的視覺模型,具有迄今為止最佳的開放世界物件檢測性能。DINO-X 使用與 Grounding DINO 1.5 相同的基於 Transformer 的編碼器-解碼器架構,以追求物件級表示,用於開放世界物件理解。為了使長尾物件檢測變得容易,DINO-X 擴展了其輸入選項,以支持文本提示、視覺提示和自定義提示。通過這些靈活的提示選項,我們開發了一個通用的物件提示,以支持無提示的開放世界檢測,從而實現在圖像中檢測任何物件而無需用戶提供任何提示。為了增強模型的核心基礎能力,我們構建了一個規模龐大的數據集,包含超過一億個高質量的基礎樣本,稱為 Grounding-100M,以提升模型的開放詞彙檢測性能。在這樣一個大規模基礎數據集上進行預訓練,導致了基礎物件級表示,使 DINO-X 能夠整合多個感知頭部,同時支持多個物件感知和理解任務,包括檢測、分割、姿勢估計、物件標題、基於物件的問答等。實驗結果顯示了 DINO-X 的優越性能。具體而言,DINO-X Pro 模型在 COCO、LVIS-minival 和 LVIS-val 零樣本物件檢測基準上分別達到了 56.0 AP、59.8 AP 和 52.4 AP。值得注意的是,它在 LVIS-minival 和 LVIS-val 基準的稀有類別上分別取得了 63.3 AP 和 56.5 AP 的成績,將先前的 SOTA 性能提高了 5.8 AP。這樣的結果凸顯了其顯著改進的長尾物件識別能力。
English
In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, both improving the previous SOTA performance by 5.8 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.

Summary

AI-Generated Summary

PDF132November 22, 2024