DINO-X：一個統一的視覺模型，用於開放世界的物體檢測和理解。

摘要

本文介紹了由 IDEA 研究團隊開發的 DINO-X，這是一個統一的以物件為中心的視覺模型，具有迄今為止最佳的開放世界物件檢測性能。DINO-X 使用與 Grounding DINO 1.5 相同的基於 Transformer 的編碼器-解碼器架構，以追求物件級表示，用於開放世界物件理解。為了使長尾物件檢測變得容易，DINO-X 擴展了其輸入選項，以支持文本提示、視覺提示和自定義提示。通過這些靈活的提示選項，我們開發了一個通用的物件提示，以支持無提示的開放世界檢測，從而實現在圖像中檢測任何物件而無需用戶提供任何提示。為了增強模型的核心基礎能力，我們構建了一個規模龐大的數據集，包含超過一億個高質量的基礎樣本，稱為 Grounding-100M，以提升模型的開放詞彙檢測性能。在這樣一個大規模基礎數據集上進行預訓練，導致了基礎物件級表示，使 DINO-X 能夠整合多個感知頭部，同時支持多個物件感知和理解任務，包括檢測、分割、姿勢估計、物件標題、基於物件的問答等。實驗結果顯示了 DINO-X 的優越性能。具體而言，DINO-X Pro 模型在 COCO、LVIS-minival 和 LVIS-val 零樣本物件檢測基準上分別達到了 56.0 AP、59.8 AP 和 52.4 AP。值得注意的是，它在 LVIS-minival 和 LVIS-val 基準的稀有類別上分別取得了 63.3 AP 和 56.5 AP 的成績，將先前的 SOTA 性能提高了 5.8 AP。這樣的結果凸顯了其顯著改進的長尾物件識別能力。

English

In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, both improving the previous SOTA performance by 5.8 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.

DINO-X：一個統一的視覺模型，用於開放世界的物體檢測和理解。

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

摘要

Summary

Support