EmoKnob:利用精細情感控制增強語音克隆
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control
October 1, 2024
作者: Haozhe Chen, Run Chen, Julia Hirschberg
cs.AI
摘要
近年來,文本轉語音(TTS)技術的進步使語音更加自然和表達豐富,但缺乏讓用戶選擇情感和控制強度的選項。我們提出 EmoKnob,一個框架,允許在語音合成中細粒度地控制情感,並使用少量展示性任意情感的示例。我們的框架利用最近基礎語音克隆模型的進步所實現的豐富語者表示空間。基於我們情感控制框架的少量展示能力,我們提出兩種方法來應用情感控制於由開放式文本描述的情感,實現一個直觀的界面,用於控制多樣微妙情感的陣列。為了促進更系統化的情感語音合成領域,我們引入了一組旨在嚴格評估情感控制框架的忠實度和可識別性的評估指標。通過客觀和主觀評估,我們展示了我們的情感控制框架有效地將情感嵌入語音中,並超越了商業TTS服務的情感表達能力。
English
While recent advances in Text-to-Speech (TTS) technology produce natural and
expressive speech, they lack the option for users to select emotion and control
intensity. We propose EmoKnob, a framework that allows fine-grained emotion
control in speech synthesis with few-shot demonstrative samples of arbitrary
emotion. Our framework leverages the expressive speaker representation space
made possible by recent advances in foundation voice cloning models. Based on
the few-shot capability of our emotion control framework, we propose two
methods to apply emotion control on emotions described by open-ended text,
enabling an intuitive interface for controlling a diverse array of nuanced
emotions. To facilitate a more systematic emotional speech synthesis field, we
introduce a set of evaluation metrics designed to rigorously assess the
faithfulness and recognizability of emotion control frameworks. Through
objective and subjective evaluations, we show that our emotion control
framework effectively embeds emotions into speech and surpasses emotion
expressiveness of commercial TTS services.Summary
AI-Generated Summary