ChatPaper.aiChatPaper

使用基於每個標記潛在擴散的連續語音合成

Continuous Speech Synthesis using per-token Latent Diffusion

October 21, 2024
作者: Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Avihu Dekel
cs.AI

摘要

基於離散標記的自回歸Transformer模型的成功,啟發了針對連續模態的量化方法,儘管這些方法通常會限制重建質量。因此,我們引入了SALAD,一種針對零-shot文本轉語音的每標記潛在擴散模型,它在連續表示上運作。SALAD建立在最近提出的用於圖像生成的具表達力擴散頭之上,並將其擴展為生成可變長度的輸出。我們的方法利用語義標記提供語境信息並確定停止條件。我們為我們的方法提出了三種連續變體,擴展了流行的離散語音合成技術。此外,我們為每個變體實施了離散基準線,並對離散與連續語音建模技術進行了比較分析。我們的結果表明,連續和離散方法都非常有競爭力,而SALAD在獲得與地面實況音頻相當的語音質量和說話者相似度的同時,實現了優越的可懂度得分。
English
The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

Summary

AI-Generated Summary

PDF303November 16, 2024