使用基于每个标记的潜在扩散的连续语音合成

摘要

基于离散标记的自回归Transformer模型的成功启发了基于量化的连续模态方法，尽管这些方法通常会限制重建质量。因此，我们引入了SALAD，一种逐标记潜在扩散模型，用于零样本文本转语音，其操作基于连续表示。SALAD建立在最近提出的用于图像生成的表达扩散头基础上，并将其扩展为生成可变长度的输出。我们的方法利用语义标记提供上下文信息并确定停止条件。我们为我们的方法提出了三种连续变体，扩展了流行的离散语音合成技术。此外，我们为每种变体实现了离散基线，并对离散与连续语音建模技术进行了比较分析。我们的结果表明，连续和离散方法都非常有竞争力，并且SALAD在获得语音质量和说话者相似性与基准音频相媲美的同时，实现了更高的可懂度得分。

English

The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

使用基于每个标记的潜在扩散的连续语音合成

Continuous Speech Synthesis using per-token Latent Diffusion

摘要

Summary

Support

Support