VITA-1.5：邁向GPT-4o級實時視覺和語音互動

摘要

近期的多模式大型語言模型（MLLMs）通常專注於整合視覺和文本模態，對於語音在增強互動方面的作用則較少關注。然而，語音在多模式對話系統中扮演著關鍵角色，實現視覺和語音任務的高性能仍然是一個重大挑戰，這是由於基本模態之間的差異性。本文提出了一種精心設計的多階段訓練方法，逐步訓練LLM以理解視覺和語音信息，最終實現流暢的視覺和語音互動。我們的方法不僅保留了強大的視覺語言能力，還實現了高效的語音對話能力，無需單獨的語音識別（ASR）和文本到語音合成（TTS）模塊，顯著加快了多模式端到端回應速度。通過在圖像、視頻和語音任務的基準測試中將我們的方法與最先進的對手進行比較，我們展示了我們的模型具備強大的視覺和語音能力，實現了幾乎實時的視覺和語音互動。

English

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

VITA-1.5：邁向GPT-4o級實時視覺和語音互動

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

摘要

Summary

Support