通過合成情感語句來提高語者驗證的穩健性
Improving speaker verification robustness with synthetic emotional utterances
November 30, 2024
作者: Nikhil Kumar Koditala, Chelsea Jui-Ting Ju, Ruirui Li, Minho Jin, Aman Chadha, Andreas Stolcke
cs.AI
摘要
語者驗證(SV)系統提供一項認證服務,旨在確認特定說話樣本是否來自特定說話者。這項技術為各種個性化應用奠定了基礎,以滿足個人偏好。SV系統面臨的一個值得注意的挑戰是其在各種情感範疇下保持一致性的能力。大多數現有模型在處理情感發話時的錯誤率高於中性發話。因此,這種現象常常導致錯過感興趣的說話。這個問題主要源於有標記的情感語音數據的有限可用性,阻礙了涵蓋多樣情感狀態的強大說話者表示的發展。
為了解決這個問題,我們提出了一種新方法,利用CycleGAN框架作為數據擴增方法。這種技術為每個特定說話者合成情感語音片段,同時保留獨特的聲音身份。我們的實驗結果強調了將合成的情感數據納入訓練過程的有效性。使用這種擴增數據集訓練的模型在驗證情感語音場景中的說話者任務上始終優於基準模型,將等錯誤率相對降低多達3.64%。
English
A speaker verification (SV) system offers an authentication service designed
to confirm whether a given speech sample originates from a specific speaker.
This technology has paved the way for various personalized applications that
cater to individual preferences. A noteworthy challenge faced by SV systems is
their ability to perform consistently across a range of emotional spectra. Most
existing models exhibit high error rates when dealing with emotional utterances
compared to neutral ones. Consequently, this phenomenon often leads to missing
out on speech of interest. This issue primarily stems from the limited
availability of labeled emotional speech data, impeding the development of
robust speaker representations that encompass diverse emotional states.
To address this concern, we propose a novel approach employing the CycleGAN
framework to serve as a data augmentation method. This technique synthesizes
emotional speech segments for each specific speaker while preserving the unique
vocal identity. Our experimental findings underscore the effectiveness of
incorporating synthetic emotional data into the training process. The models
trained using this augmented dataset consistently outperform the baseline
models on the task of verifying speakers in emotional speech scenarios,
reducing equal error rate by as much as 3.64% relative.Summary
AI-Generated Summary