ChatRex：驯服多模态LLM实现联合感知和理解

摘要

感知和理解是计算机视觉的两大支柱。虽然多模态大型语言模型（MLLM）展示了出色的视觉理解能力，但它们可能缺乏准确的感知能力，例如，当前最先进的模型 Qwen2-VL 仅在 COCO 数据集上实现了 43.9 的召回率，限制了许多需要结合感知和理解的任务。在这项工作中，我们旨在从模型设计和数据开发的角度弥合这一感知差距。我们首先介绍了 ChatRex，这是一个具有解耦感知设计的 MLLM。我们不是让LLM直接预测框坐标，而是将来自通用提议网络的输出框输入LLM，使其输出相应的框索引以表示其检测结果，将回归任务转变为LLM更熟练处理的检索型任务。从数据角度来看，我们构建了一个完全自动化的数据引擎，并构建了 Rexverse-2M 数据集，具有多个粒度以支持感知和理解的联合训练。经过标准的两阶段训练，ChatRex展示了强大的感知能力，同时保持了多模态理解性能。这两种能力的结合同时解锁了许多有吸引力的应用，展示了感知和理解在MLLM中的互补作用。代码可在 https://github.com/IDEA-Research/ChatRex 获取。

English

Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at https://github.com/IDEA-Research/ChatRex.

ChatRex：驯服多模态LLM实现联合感知和理解

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

摘要

Summary

Support

Support