ChatPaper.aiChatPaper

EMOVA:賦予語言模型看、聽和說話的生動情感力量

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

September 26, 2024
作者: Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Lanqing Hong, Lu Hou, Hang Xu
cs.AI

摘要

GPT-4o 是一個全模態模型,可以進行具有多樣情感和語氣的語音對話,標誌著全模態基礎模型的一個里程碑。然而,讓大型語言模型能夠以公開可用的數據感知和生成圖像、文本和語音仍然是開源社區中的一個挑戰。現有的視覺語言模型依賴外部工具進行語音處理,而語音語言模型仍然缺乏或甚至沒有視覺理解能力。為了填補這一差距,我們提出了 EMOVA(情感全在場語音助手),以賦予大型語言模型端到端的語音功能,同時保持領先的視覺語言性能。通過一個語義-聲學解耦的語音分詞器,我們驚訝地發現,全模態對齊可以進一步增強視覺語言和語音能力,相較於相應的雙模態對齊對應物。此外,我們提出了一個輕量級風格模組,用於靈活控制語音風格(例如情感和音調)。EMOVA 首次在視覺語言和語音基準上實現了最先進的性能,同時支持具有生動情感的全模態口語對話。
English
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

Summary

AI-Generated Summary

PDF4113November 16, 2024