LLMVoX:面向任意大语言模型的自回归流式文本转语音系统
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
March 6, 2025
作者: Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
cs.AI
摘要
近期,语音到语音对话系统的进展利用大语言模型(LLM)实现了多模态交互,但仍受限于微调需求、高计算开销及文本与语音的对齐问题。现有的语音增强型LLM常因修改模型而降低对话质量,损害其语言能力。相比之下,我们提出了LLMVoX,一个轻量级、仅含3000万参数、与LLM无关的自回归流式文本转语音(TTS)系统,它能在保持基础LLM全部能力的同时,以低延迟生成高质量语音。相较于语音增强型LLM,我们的方法在保持相近延迟和UTMOS评分的情况下,显著降低了词错误率。通过多队列令牌流系统将语音合成与LLM处理解耦,LLMVoX支持无缝、无限长度的对话。其即插即用设计还便于扩展至不同骨干网络的各种任务。此外,LLMVoX仅需数据集适应即可泛化至新语言,在阿拉伯语语音任务上实现了低字符错误率。我们还集成了LLMVoX与视觉语言模型,创建了一个具备语音、文本和视觉能力的全能模型,无需额外的多模态训练。我们的代码库和项目页面可在https://mbzuai-oryx.github.io/LLMVoX 访问。
English
Recent advancements in speech-to-speech dialogue systems leverage LLMs for
multimodal interactions, yet they remain hindered by fine-tuning requirements,
high computational overhead, and text-speech misalignment. Existing
speech-enabled LLMs often degrade conversational quality by modifying the LLM,
thereby compromising its linguistic capabilities. In contrast, we propose
LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS
system that generates high-quality speech with low latency, while fully
preserving the capabilities of the base LLM. Our approach achieves a
significantly lower Word Error Rate compared to speech-enabled LLMs, while
operating at comparable latency and UTMOS score. By decoupling speech synthesis
from LLM processing via a multi-queue token streaming system, LLMVoX supports
seamless, infinite-length dialogues. Its plug-and-play design also facilitates
extension to various tasks with different backbones. Furthermore, LLMVoX
generalizes to new languages with only dataset adaptation, attaining a low
Character Error Rate on an Arabic speech task. Additionally, we have integrated
LLMVoX with a Vision-Language Model to create an omni-model with speech, text,
and vision capabilities, without requiring additional multimodal training. Our
code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .Summary
AI-Generated Summary