ChatPaper.aiChatPaper

LLMVoX:面向任意大语言模型的自回归流式文本转语音系统

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

March 6, 2025
作者: Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
cs.AI

摘要

近期,语音到语音对话系统的进展利用大语言模型(LLM)实现了多模态交互,但仍受限于微调需求、高计算开销及文本与语音的对齐问题。现有的语音增强型LLM常因修改模型而降低对话质量,损害其语言能力。相比之下,我们提出了LLMVoX,一个轻量级、仅含3000万参数、与LLM无关的自回归流式文本转语音(TTS)系统,它能在保持基础LLM全部能力的同时,以低延迟生成高质量语音。相较于语音增强型LLM,我们的方法在保持相近延迟和UTMOS评分的情况下,显著降低了词错误率。通过多队列令牌流系统将语音合成与LLM处理解耦,LLMVoX支持无缝、无限长度的对话。其即插即用设计还便于扩展至不同骨干网络的各种任务。此外,LLMVoX仅需数据集适应即可泛化至新语言,在阿拉伯语语音任务上实现了低字符错误率。我们还集成了LLMVoX与视觉语言模型,创建了一个具备语音、文本和视觉能力的全能模型,无需额外的多模态训练。我们的代码库和项目页面可在https://mbzuai-oryx.github.io/LLMVoX 访问。
English
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

Summary

AI-Generated Summary

PDF605March 7, 2025