VITA-1.5：朝着GPT-4o级实时视觉和语音交互的方向前进

摘要

最近的多模态大型语言模型（MLLMs）通常侧重于整合视觉和文本模态，对于语音在增强交互方面的作用则关注较少。然而，语音在多模态对话系统中扮演着至关重要的角色，实现视觉和语音任务的高性能仍然是一个重大挑战，因为两者存在根本的模态差异。在本文中，我们提出了一个精心设计的多阶段训练方法，逐步训练LLM以理解视觉和语音信息，最终实现流畅的视觉和语音交互。我们的方法不仅保留了强大的视觉-语言能力，还实现了高效的语音对话能力，无需单独的自动语音识别（ASR）和文本到语音合成（TTS）模块，显著加快了多模态端到端响应速度。通过在图像、视频和语音任务的基准测试中将我们的方法与最先进的对手进行比较，我们展示了我们的模型具备强大的视觉和语音能力，实现了几乎实时的视觉和语音交互。

English

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

VITA-1.5：朝着GPT-4o级实时视觉和语音交互的方向前进

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

摘要

Summary

Support