OpenOmni:大型語言模型在跨語言上實現零-shot 全模態對齊,並具有即時自我感知情感語音合成。
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
January 8, 2025
作者: Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang
cs.AI
摘要
最近在全模態學習方面取得了重要進展,涵蓋了圖像、文本和語音的理解和生成,儘管主要是在專有模型中實現的。由於全模態數據集有限,以及與實時情感語音生成相關的固有挑戰,阻礙了開源進展。為了應對這些問題,我們提出了openomni,一種結合全模態對齊和語音生成的兩階段訓練方法,以開發最先進的全模態大型語言模型。在對齊階段,一個預訓練的語音模型進一步在文本-圖像任務上進行訓練,以(近乎)零-shot方式從視覺泛化到語音,勝過在三模態數據集上訓練的模型。在語音生成階段,一個輕量級解碼器通過在語音任務和偏好學習上進行訓練,實現實時情感語音。實驗表明,openomni在全模態、視覺-語言和語音-語言評估中持續改進,實現自然、情感豐富的對話和實時情感語音生成。
English
Recent advancements in omnimodal learning have been achieved in understanding
and generation across images, text, and speech, though mainly within
proprietary models. Limited omnimodal datasets and the inherent challenges
associated with real-time emotional speech generation have hindered open-source
progress. To address these issues, we propose openomni, a two-stage training
method combining omnimodal alignment and speech generation to develop a
state-of-the-art omnimodal large language model. In the alignment phase, a
pre-trained speech model is further trained on text-image tasks to generalize
from vision to speech in a (near) zero-shot manner, outperforming models
trained on tri-modal datasets. In the speech generation phase, a lightweight
decoder facilitates real-time emotional speech through training on speech tasks
and preference learning. Experiments demonstrate that openomni consistently
improves across omnimodal, vision-language, and speech-language evaluations,
enabling natural, emotion-rich dialogues and real-time emotional speech
generation.Summary
AI-Generated Summary