토큰별 잠재 확산을 이용한 연속 음성 합성

초록

이산 토큰을 사용한 자기 회귀 트랜스포머 모델의 성공은 연속적인 모달리티에 대한 양자화 기반 접근법을 영감을 주었지만, 이러한 방법들은 종종 재구성 품질을 제한합니다. 따라서 우리는 연속적인 표현에서 작동하는 제로샷 텍스트-투-스피치를 위한 토큰 당 잠재 확산 모델인 SALAD를 소개합니다. SALAD는 최근 제안된 이미지 생성을 위한 표현력 있는 확산 헤드를 기반으로 구축되었으며, 가변 길이의 출력을 생성하기 위해 확장되었습니다. 우리의 방법은 문맥 정보를 제공하고 중지 조건을 결정하기 위해 의미 있는 토큰을 활용합니다. 우리는 우리의 방법을 위해 세 가지 연속적인 변형을 제안하며, 인기 있는 이산 음성 합성 기술을 확장합니다. 또한, 각 변형에 대해 이산적인 기준선을 구현하고 이산적 대 연속적 음성 모델링 기술에 대한 비교 분석을 수행합니다. 우리의 결과는 연속적 및 이산적 접근법이 모두 높은 경쟁력을 갖고 있으며, SALAD가 더 뛰어난 이해도 점수를 달성하면서 실제 오디오와 동일한 음성 품질 및 화자 유사성을 얻는 것을 보여줍니다.

English

The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

토큰별 잠재 확산을 이용한 연속 음성 합성

Continuous Speech Synthesis using per-token Latent Diffusion

초록

Support