VITA-1.5:邁向GPT-4o級實時視覺和語音互動
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
January 3, 2025
作者: Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He
cs.AI
摘要
近期的多模式大型語言模型(MLLMs)通常專注於整合視覺和文本模態,對於語音在增強互動方面的作用則較少關注。然而,語音在多模式對話系統中扮演著關鍵角色,實現視覺和語音任務的高性能仍然是一個重大挑戰,這是由於基本模態之間的差異性。本文提出了一種精心設計的多階段訓練方法,逐步訓練LLM以理解視覺和語音信息,最終實現流暢的視覺和語音互動。我們的方法不僅保留了強大的視覺語言能力,還實現了高效的語音對話能力,無需單獨的語音識別(ASR)和文本到語音合成(TTS)模塊,顯著加快了多模式端到端回應速度。通過在圖像、視頻和語音任務的基準測試中將我們的方法與最先進的對手進行比較,我們展示了我們的模型具備強大的視覺和語音能力,實現了幾乎實時的視覺和語音互動。
English
Recent Multimodal Large Language Models (MLLMs) have typically focused on
integrating visual and textual modalities, with less emphasis placed on the
role of speech in enhancing interaction. However, speech plays a crucial role
in multimodal dialogue systems, and implementing high-performance in both
vision and speech tasks remains a significant challenge due to the fundamental
modality differences. In this paper, we propose a carefully designed
multi-stage training methodology that progressively trains LLM to understand
both visual and speech information, ultimately enabling fluent vision and
speech interaction. Our approach not only preserves strong vision-language
capacity, but also enables efficient speech-to-speech dialogue capabilities
without separate ASR and TTS modules, significantly accelerating multimodal
end-to-end response speed. By comparing our method against state-of-the-art
counterparts across benchmarks for image, video, and speech tasks, we
demonstrate that our model is equipped with both strong visual and speech
capabilities, making near real-time vision and speech interaction.Summary
AI-Generated Summary