MinMo:一個無縫語音互動的多模式大型語言模型

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

January 10, 2025
作者: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou
cs.AI

摘要

最近大型語言模型(LLMs)和多模態語音文本模型的進步為無縫語音互動奠定了基礎,實現了實時、自然和類人對話。過去的語音互動模型可分為本地和對齊兩類。本地模型將語音和文本處理整合在一個框架中,但在處理不同序列長度和不足的預訓練等問題上遇到困難。對齊模型保留了文本LLM的能力,但通常受限於小數據集和對語音任務的狹窄關注。在本研究中,我們介紹了MinMo,一個具有約80億參數的多模態大型語言模型,用於實現無縫語音互動。我們解決了先前對齊多模態模型的主要限制。我們通過多個階段的語音轉文本對齊、文本轉語音對齊、語音對語音對齊和雙工互動對齊,在140萬小時的多樣語音數據和廣泛的語音任務上訓練MinMo。在多階段訓練後,MinMo在語音理解和生成的各種基準測試中實現了最先進的性能,同時保持了文本LLM的能力,並實現了全雙工對話,即用戶和系統之間的同時雙向通信。此外,我們提出了一種新穎且簡單的語音解碼器,優於先前的語音生成模型。MinMo的增強指令遵循能力支持基於用戶指令控制語音生成,包括情感、方言和說話速度等各種細微差異,並模仿特定聲音。對於MinMo,語音轉文本延遲約為100ms,全雙工延遲在理論上約為600ms,在實踐中約為800ms。MinMo項目網頁為https://funaudiollm.github.io/minmo,代碼和模型將很快發布。
English
Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.

Summary

AI-Generated Summary

PDF325January 14, 2025