ChatRex：馴服多模式LLM以進行聯合感知與理解

摘要

知覺和理解是計算機視覺的兩大支柱。儘管多模式大型語言模型（MLLM）展示了卓越的視覺理解能力，但可以說它們缺乏準確的知覺能力，例如，最先進的模型Qwen2-VL在COCO數據集上僅實現了43.9的召回率，這限制了許多需要結合知覺和理解的任務。在這項工作中，我們旨在從模型設計和數據開發的角度來彌補這種知覺差距。我們首先介紹ChatRex，這是一種具有解耦知覺設計的MLLM。我們不是讓LLM直接預測框框座標，而是將來自通用提議網絡的輸出框框餵入LLM，使其能夠輸出相應的框框索引來表示其檢測結果，將回歸任務轉換為LLM更熟練處理的檢索式任務。從數據角度來看，我們構建了一個完全自動化的數據引擎，並構建了Rexverse-2M數據集，具有多個粒度，以支持知覺和理解的聯合訓練。經過標準的兩階段訓練，ChatRex展示了強大的知覺能力，同時保持了多模式理解性能。這兩種能力的結合同時解鎖了許多有吸引力的應用，展示了知覺和理解在MLLM中的互補作用。代碼可在https://github.com/IDEA-Research/ChatRex找到。

English

Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at https://github.com/IDEA-Research/ChatRex.

ChatRex：馴服多模式LLM以進行聯合感知與理解

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

摘要

Support