OpenOmni:大型语言模型在跨语言零-shot 情况下实现实时自我感知情感语音合成对齐
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
January 8, 2025
作者: Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang
cs.AI
摘要
最近在理解和生成图像、文本和语音方面取得了全模态学习的最新进展,尽管主要集中在专有模型内。有限的全模态数据集以及与实时情感语音生成相关的固有挑战阻碍了开源进展。为了解决这些问题,我们提出了openomni,这是一种两阶段训练方法,结合了全模态对齐和语音生成,以开发最先进的全模态大型语言模型。在对齐阶段,一个预训练的语音模型进一步在文本-图像任务上进行训练,以实现从视觉到语音的泛化(几乎)零-shot方式,胜过在三模态数据集上训练的模型。在语音生成阶段,一个轻量级解码器通过在语音任务和偏好学习上进行训练,促进实时情感语音。实验证明,openomni在全模态、视觉-语言和语音-语言评估中持续改进,实现自然、富有情感的对话和实时情感语音生成。
English
Recent advancements in omnimodal learning have been achieved in understanding
and generation across images, text, and speech, though mainly within
proprietary models. Limited omnimodal datasets and the inherent challenges
associated with real-time emotional speech generation have hindered open-source
progress. To address these issues, we propose openomni, a two-stage training
method combining omnimodal alignment and speech generation to develop a
state-of-the-art omnimodal large language model. In the alignment phase, a
pre-trained speech model is further trained on text-image tasks to generalize
from vision to speech in a (near) zero-shot manner, outperforming models
trained on tri-modal datasets. In the speech generation phase, a lightweight
decoder facilitates real-time emotional speech through training on speech tasks
and preference learning. Experiments demonstrate that openomni consistently
improves across omnimodal, vision-language, and speech-language evaluations,
enabling natural, emotion-rich dialogues and real-time emotional speech
generation.Summary
AI-Generated Summary