EMOVA：賦予語言模型看、聽和說話的生動情感力量

摘要

GPT-4o 是一個全模態模型，可以進行具有多樣情感和語氣的語音對話，標誌著全模態基礎模型的一個里程碑。然而，讓大型語言模型能夠以公開可用的數據感知和生成圖像、文本和語音仍然是開源社區中的一個挑戰。現有的視覺語言模型依賴外部工具進行語音處理，而語音語言模型仍然缺乏或甚至沒有視覺理解能力。為了填補這一差距，我們提出了 EMOVA（情感全在場語音助手），以賦予大型語言模型端到端的語音功能，同時保持領先的視覺語言性能。通過一個語義-聲學解耦的語音分詞器，我們驚訝地發現，全模態對齊可以進一步增強視覺語言和語音能力，相較於相應的雙模態對齊對應物。此外，我們提出了一個輕量級風格模組，用於靈活控制語音風格（例如情感和音調）。EMOVA 首次在視覺語言和語音基準上實現了最先進的性能，同時支持具有生動情感的全模態口語對話。

English

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

EMOVA：賦予語言模型看、聽和說話的生動情感力量

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

摘要

Summary

Support

Support