MinMo: 음성 상호작용을 위한 다중 모달 대형 언어 모델

초록

최근 대형 언어 모델(LLM)과 다중 모달 음성-텍스트 모델의 발전은 매끄러운 음성 상호작용을 위한 기초를 마련하며, 실시간, 자연스럽고 인간과 유사한 대화를 가능하게 합니다. 이전 음성 상호작용 모델은 원시(native) 및 정렬(aligned)으로 분류됩니다. 원시 모델은 음성 및 텍스트 처리를 하나의 프레임워크에 통합하지만 서로 다른 시퀀스 길이와 불충분한 사전 훈련과 같은 문제에 직면합니다. 정렬 모델은 텍스트 LLM 기능을 유지하지만 작은 데이터셋과 음성 작업에 대한 좁은 초점으로 제한됩니다. 본 연구에서는 매머드(MinMo)라는 대규모 다중 모달 언어 모델을 소개합니다. 약 80억 개의 매개변수를 갖춘 MinMo는 매끄러운 음성 상호작용을 위해 설계되었습니다. 이전 정렬된 다중 모달 모델의 주요 제한 사항을 해결합니다. 우리는 MinMo를 다양한 음성 데이터와 다양한 음성 작업에 대한 1.4백만 시간의 훈련 데이터를 통해 음성-텍스트 정렬, 텍스트-음성 정렬, 음성-음성 정렬 및 이중 상호작용 정렬의 여러 단계로 훈련합니다. 다단계 훈련 후 MinMo는 음성 이해 및 생성에 대한 다양한 벤치마크에서 최첨단 성능을 달성하며 텍스트 LLM의 기능을 유지하고 또한 사용자와 시스템 간의 동시 양방향 통신인 풀-듀플렉스 대화를 용이하게 합니다. 더불어, 우리는 음성 생성에서 이전 모델을 능가하는 혁신적이고 간단한 음성 디코더를 제안합니다. MinMo의 향상된 지시 따르기 능력은 사용자 지시에 따라 음성 생성을 제어하며, 감정, 사투리, 말 속도 및 특정 목소리 모방을 포함한 다양한 뉘앙스를 지원합니다. MinMo의 음성-텍스트 지연 시간은 약 100ms이며, 풀-듀플렉스 지연 시간은 이론적으로 약 600ms이며 실제로는 약 800ms입니다. MinMo 프로젝트 웹 페이지는 https://funaudiollm.github.io/minmo이며, 코드와 모델은 곧 공개될 예정입니다.

English

Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.

MinMo: 음성 상호작용을 위한 다중 모달 대형 언어 모델

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

초록

Support