ChatRex: 다중 모달 LLM을 조절하여 공동 인식과 이해를 향상시키다.

초록

지각과 이해는 컴퓨터 비전의 두 대요소입니다. 다중 모달 대형 언어 모델(Multimodal Large Language Models, MLLM)은 놀라운 시각적 이해 능력을 보여주었지만, 정확한 지각 능력이 부족하다는 주장도 있습니다. 예를 들어, 최신 모델인 Qwen2-VL은 COCO 데이터셋에서 43.9의 검출률만을 달성하여, 지각과 이해를 결합하는 많은 작업에 제한이 있습니다. 본 연구에서는 이 지각 간극을 모델 설계와 데이터 개발 관점에서 해소하고자 합니다. 먼저, 우리는 ChatRex를 소개합니다. 이는 분리된 지각 디자인을 갖춘 MLLM입니다. LLM이 직접 상자 좌표를 예측하는 대신, 우리는 모든 제안 네트워크로부터 출력 상자를 LLM에 공급하여 해당 상자 인덱스를 출력하도록 하여 검출 결과를 나타내게 합니다. 이를 통해 회귀 작업을 검색 기반 작업으로 전환하여 LLM이 더 능숙하게 처리할 수 있게 합니다. 데이터 관점에서는 완전 자동화된 데이터 엔진을 구축하고, 지각과 이해의 공동 훈련을 지원하기 위해 여러 단계를 갖춘 Rexverse-2M 데이터셋을 구축합니다. 표준 두 단계 훈련 후, ChatRex는 강력한 지각 능력을 보여주면서 다중 모달 이해 성능을 유지합니다. 이 두 능력의 결합은 많은 매력적인 응용 프로그램을 동시에 가능하게 하며, MLLM에서 지각과 이해의 보완적 역할을 보여줍니다. 코드는 https://github.com/IDEA-Research/ChatRex에서 확인할 수 있습니다.

English

Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at https://github.com/IDEA-Research/ChatRex.

ChatRex: 다중 모달 LLM을 조절하여 공동 인식과 이해를 향상시키다.

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

초록

Summary

Support