TTS를 위한 제로샷 크로스-언어 음성 전이

초록

본 논문에서는 다국어 텍스트 음성 변환(TTS) 시스템에 매끄럽게 통합될 수 있는 제로샷 음성 전이(VT) 모듈을 소개합니다. 이 모듈은 개인의 음성을 언어 간에 전이할 수 있습니다. 제안된 VT 모듈은 참조 음성을 처리하는 스피커-인코더, 병목층, 그리고 기존 TTS 레이어에 연결된 잔여 어댑터로 구성됩니다. 우리는 이러한 구성 요소의 다양한 설정의 성능을 비교하고 다국어 간 평균 의견 점수(MOS)와 스피커 유사성을 보고합니다. 각 화자 당 단일 영어 참조 음성을 사용하여, 우리는 9개의 대상 언어 간에 73%의 평균 음성 전이 유사성 점수를 달성했습니다. 음성 특성은 개인 신원의 형성과 인식에 상당한 영향을 미칩니다. 신체적이거나 신경학적인 상태로 인해 자신의 음성을 잃는 것은 핵심적인 신원에 깊은 감정 손실을 초래할 수 있습니다. 사례 연구로, 우리의 접근 방식이 전형적인 음성 뿐만 아니라 이상 발음 환자의 음성을 복원할 수 있음을 보여줍니다. 심지어 전형적인 음성이 없거나 음성을 보관하지 않은 사람들에게 유용한 기능입니다. 다국어 전형적인 오디오 샘플 및 이상 발음 환자의 음성 복원을 보여주는 비디오는 여기에서 확인할 수 있습니다 (google.github.io/tacotron/publications/zero_shot_voice_transfer).

English

In this paper, we introduce a zero-shot Voice Transfer (VT) module that can be seamlessly integrated into a multi-lingual Text-to-speech (TTS) system to transfer an individual's voice across languages. Our proposed VT module comprises a speaker-encoder that processes reference speech, a bottleneck layer, and residual adapters, connected to preexisting TTS layers. We compare the performance of various configurations of these components and report Mean Opinion Score (MOS) and Speaker Similarity across languages. Using a single English reference speech per speaker, we achieve an average voice transfer similarity score of 73% across nine target languages. Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, due to physical or neurological conditions, can lead to a profound sense of loss, impacting one's core identity. As a case study, we demonstrate that our approach can not only transfer typical speech but also restore the voices of individuals with dysarthria, even when only atypical speech samples are available - a valuable utility for those who have never had typical speech or banked their voice. Cross-lingual typical audio samples, plus videos demonstrating voice restoration for dysarthric speakers are available here (google.github.io/tacotron/publications/zero_shot_voice_transfer).

TTS를 위한 제로샷 크로스-언어 음성 전이

Zero-shot Cross-lingual Voice Transfer for TTS

초록

Summary

Support

Support