LLMVoX：面向任意大语言模型的自回归流式文本转语音系统

摘要

近期，语音到语音对话系统的进展利用大语言模型（LLM）实现了多模态交互，但仍受限于微调需求、高计算开销及文本与语音的对齐问题。现有的语音增强型LLM常因修改模型而降低对话质量，损害其语言能力。相比之下，我们提出了LLMVoX，一个轻量级、仅含3000万参数、与LLM无关的自回归流式文本转语音（TTS）系统，它能在保持基础LLM全部能力的同时，以低延迟生成高质量语音。相较于语音增强型LLM，我们的方法在保持相近延迟和UTMOS评分的情况下，显著降低了词错误率。通过多队列令牌流系统将语音合成与LLM处理解耦，LLMVoX支持无缝、无限长度的对话。其即插即用设计还便于扩展至不同骨干网络的各种任务。此外，LLMVoX仅需数据集适应即可泛化至新语言，在阿拉伯语语音任务上实现了低字符错误率。我们还集成了LLMVoX与视觉语言模型，创建了一个具备语音、文本和视觉能力的全能模型，无需额外的多模态训练。我们的代码库和项目页面可在https://mbzuai-oryx.github.io/LLMVoX 访问。

English

Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

LLMVoX：面向任意大语言模型的自回归流式文本转语音系统

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

摘要

Summary

Support

Support