MinMo：一个用于无缝语音交互的多模态大型语言模型

摘要

最近对大型语言模型（LLMs）和多模态语音文本模型的进展为实现无缝语音交互奠定了基础，实现了实时、自然和类人对话。以往的语音交互模型可分为本地和对齐两类。本地模型将语音和文本处理集成在一个框架中，但在处理不同序列长度和预训练不足等问题上存在困难。对齐模型保留文本LLM功能，但往往受限于小数据集和狭窄的语音任务范围。在本研究中，我们介绍了MinMo，一个具有约80亿参数的多模态大型语言模型，用于实现无缝语音交互。我们解决了以往对齐多模态模型的主要局限性。我们通过多阶段的语音到文本对齐、文本到语音对齐、语音到语音对齐和双工交互对齐，在140万小时的多样化语音数据和广泛的语音任务上训练MinMo。经过多阶段训练，MinMo在语音理解和生成的各种基准测试中取得了最先进的性能，同时保持了文本LLM的功能，并且实现了全双工对话，即用户和系统之间的同时双向通信。此外，我们提出了一种新颖简单的语音解码器，优于以往的语音生成模型。MinMo的增强指令遵循能力支持根据用户指令控制语音生成，包括情感、方言、语速和模仿特定声音等各种细微差别。对于MinMo，语音到文本的延迟约为100毫秒，全双工延迟在理论上约为600毫秒，在实践中约为800毫秒。MinMo项目网页为https://funaudiollm.github.io/minmo，代码和模型将很快发布。

English

Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.

MinMo：一个用于无缝语音交互的多模态大型语言模型

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

摘要

Support