通过合成情感语音来提高说话人验证的稳健性
Improving speaker verification robustness with synthetic emotional utterances
November 30, 2024
作者: Nikhil Kumar Koditala, Chelsea Jui-Ting Ju, Ruirui Li, Minho Jin, Aman Chadha, Andreas Stolcke
cs.AI
摘要
说话者验证(SV)系统提供了一种认证服务,旨在确认特定语音样本是否来自特定说话者。这项技术为各种个性化应用铺平了道路,满足个人偏好。SV系统面临的一个值得注意的挑战是其在各种情感谱上的一致性表现。与中性语音相比,大多数现有模型在处理情绪话语时表现出较高的错误率。因此,这种现象经常导致错过感兴趣的语音。这个问题主要源自有限标记的情感语音数据的可用性,阻碍了涵盖多种情感状态的稳健说话者表示的发展。
为了解决这一问题,我们提出了一种新颖方法,利用CycleGAN框架作为数据增强方法。这种技术为每个特定说话者合成情感语音片段,同时保留独特的声音特征。我们的实验结果强调了将合成情感数据纳入训练过程的有效性。使用这种增强数据集训练的模型在验证情感语音场景中的说话者任务上始终优于基线模型,将等误差率相对降低高达3.64%。
English
A speaker verification (SV) system offers an authentication service designed
to confirm whether a given speech sample originates from a specific speaker.
This technology has paved the way for various personalized applications that
cater to individual preferences. A noteworthy challenge faced by SV systems is
their ability to perform consistently across a range of emotional spectra. Most
existing models exhibit high error rates when dealing with emotional utterances
compared to neutral ones. Consequently, this phenomenon often leads to missing
out on speech of interest. This issue primarily stems from the limited
availability of labeled emotional speech data, impeding the development of
robust speaker representations that encompass diverse emotional states.
To address this concern, we propose a novel approach employing the CycleGAN
framework to serve as a data augmentation method. This technique synthesizes
emotional speech segments for each specific speaker while preserving the unique
vocal identity. Our experimental findings underscore the effectiveness of
incorporating synthetic emotional data into the training process. The models
trained using this augmented dataset consistently outperform the baseline
models on the task of verifying speakers in emotional speech scenarios,
reducing equal error rate by as much as 3.64% relative.Summary
AI-Generated Summary