VITA-1.5: GPT-4 수준의 실시간 비전 및 음성 상호작용으로의 진화

초록

최근의 다중 모달 대형 언어 모델(MLLMs)은 일반적으로 시각적 및 텍스트 모달리티를 통합하는 데 초점을 맞추었으며, 상호 작용을 향상시키는 데 음성의 역할에 덜 중점을 두었습니다. 그러나 음성은 다중 모달 대화 시스템에서 중요한 역할을 하며, 시각 및 음성 작업 모두에서 고성능을 구현하는 것은 기본적인 모달리티 차이로 인해 중요한 도전 과제입니다. 본 논문에서는 시각 및 음성 정보를 이해하도록 LLM을 점진적으로 훈련시키는 신중하게 설계된 다단계 훈련 방법론을 제안하여 궁극적으로 순조로운 시각 및 음성 상호 작용을 가능하게 합니다. 우리의 접근 방식은 강력한 시각-언어 능력을 유지할 뿐만 아니라 별도의 음성 인식(ASR) 및 음성 합성(TTS) 모듈 없이 효율적인 음성 대화 능력을 제공하여 다중 모달 엔드 투 엔드 응답 속도를 크게 가속화합니다. 이미지, 비디오 및 음성 작업 벤치마크를 통해 우리의 방법을 최첨단 대조군과 비교함으로써, 우리의 모델이 강력한 시각 및 음성 능력을 갖추어 거의 실시간 시각 및 음성 상호 작용을 가능하게 한다는 것을 입증합니다.

English

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

VITA-1.5: GPT-4 수준의 실시간 비전 및 음성 상호작용으로의 진화

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

초록

Support