VITA-1.5:朝着GPT-4o级实时视觉和语音交互的方向前进
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
January 3, 2025
作者: Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He
cs.AI
摘要
最近的多模态大型语言模型(MLLMs)通常侧重于整合视觉和文本模态,对于语音在增强交互方面的作用则关注较少。然而,语音在多模态对话系统中扮演着至关重要的角色,实现视觉和语音任务的高性能仍然是一个重大挑战,因为两者存在根本的模态差异。在本文中,我们提出了一个精心设计的多阶段训练方法,逐步训练LLM以理解视觉和语音信息,最终实现流畅的视觉和语音交互。我们的方法不仅保留了强大的视觉-语言能力,还实现了高效的语音对话能力,无需单独的自动语音识别(ASR)和文本到语音合成(TTS)模块,显著加快了多模态端到端响应速度。通过在图像、视频和语音任务的基准测试中将我们的方法与最先进的对手进行比较,我们展示了我们的模型具备强大的视觉和语音能力,实现了几乎实时的视觉和语音交互。
English
Recent Multimodal Large Language Models (MLLMs) have typically focused on
integrating visual and textual modalities, with less emphasis placed on the
role of speech in enhancing interaction. However, speech plays a crucial role
in multimodal dialogue systems, and implementing high-performance in both
vision and speech tasks remains a significant challenge due to the fundamental
modality differences. In this paper, we propose a carefully designed
multi-stage training methodology that progressively trains LLM to understand
both visual and speech information, ultimately enabling fluent vision and
speech interaction. Our approach not only preserves strong vision-language
capacity, but also enables efficient speech-to-speech dialogue capabilities
without separate ASR and TTS modules, significantly accelerating multimodal
end-to-end response speed. By comparing our method against state-of-the-art
counterparts across benchmarks for image, video, and speech tasks, we
demonstrate that our model is equipped with both strong visual and speech
capabilities, making near real-time vision and speech interaction.Summary
AI-Generated Summary