ChatPaper.aiChatPaper

Takin:一組優質的零樣本語音生成模型

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

September 18, 2024
作者: EverestAI, Sijin Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jingjing Yin, Jianhao Ye, Jixun Yao, Quanlei Yan, Yuguang Yang
cs.AI

摘要

隨著大數據和大型語言模型時代的來臨,零-shot個性化快速定制已成為一個重要趨勢。在本報告中,我們介紹了Takin AudioLLM,這是一系列主要包括Takin TTS、Takin VC和Takin Morphing等技術和模型,專門為有聲書製作而設計。這些模型能夠進行零-shot語音生成,產生幾乎無法區分真實人類語音的高質量語音,並幫助個人根據自己的需求定制語音內容。具體而言,我們首先介紹了Takin TTS,這是一種神經編解碼語言模型,它建立在增強型神經語音編解碼器和多任務訓練框架之上,能夠以零-shot方式生成高保真自然語音。對於Takin VC,我們提倡一種有效的內容和音色聯合建模方法來提高說話者相似度,同時提倡一種基於條件流匹配的解碼器來進一步增強其自然性和表現力。最後,我們提出了Takin Morphing系統,採用高度解耦和先進的音色和韻律建模方法,使個人能夠以精確可控的方式定制其喜好的音色和韻律進行語音生成。大量實驗驗證了我們的Takin AudioLLM系列模型的有效性和韌性。有關詳細演示,請參閱https://takinaudiollm.github.io。
English
With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-quality speech that is nearly indistinguishable from real human speech and facilitating individuals to customize the speech content according to their own needs. Specifically, we first introduce Takin TTS, a neural codec language model that builds upon an enhanced neural speech codec and a multi-task training framework, capable of generating high-fidelity natural speech in a zero-shot way. For Takin VC, we advocate an effective content and timbre joint modeling approach to improve the speaker similarity, while advocating for a conditional flow matching based decoder to further enhance its naturalness and expressiveness. Last, we propose the Takin Morphing system with highly decoupled and advanced timbre and prosody modeling approaches, which enables individuals to customize speech production with their preferred timbre and prosody in a precise and controllable manner. Extensive experiments validate the effectiveness and robustness of our Takin AudioLLM series models. For detailed demos, please refer to https://takinaudiollm.github.io.

Summary

AI-Generated Summary

PDF124November 16, 2024