Llasa：基于Llama的语音合成的训练时间和推理时间计算的扩展

摘要

最近在基于文本的大型语言模型（LLMs）方面取得的进展，特别是GPT系列和o1模型，展示了在训练时间和推理时间计算方面扩展的有效性。然而，当前最先进的TTS系统利用LLMs往往是多阶段的，需要单独的模型（例如，在LLM之后的扩散模型），这使得在训练或测试期间决定是否扩展特定模型变得复杂。本文提出以下贡献：首先，我们探讨了语音合成的训练时间和推理时间计算的扩展。其次，我们提出了一个简单的框架Llasa用于语音合成，采用单层向量量化器（VQ）编解码器和单个Transformer架构，以完全与标准LLMs（如Llama）保持一致。我们的实验显示，为Llasa扩展训练时间计算始终提高了合成语音的自然度，并实现了更复杂和准确的韵律模式生成。此外，从扩展推理时间计算的角度来看，我们在搜索过程中使用语音理解模型作为验证器，发现扩展推理时间计算将采样模式转向特定验证器的偏好，从而提高了情感表达能力、音色一致性和内容准确性。此外，我们公开发布了我们的TTS模型（1B、3B、8B）和编解码器模型的检查点和训练代码。

English

Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

Llasa：基于Llama的语音合成的训练时间和推理时间计算的扩展

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

摘要

Summary

Support