HiFi-SR:用于高保真度语音超分辨率的统一生成变压器卷积对抗网络
HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution
January 17, 2025
作者: Shengkui Zhao, Kun Zhou, Zexu Pan, Yukun Ma, Chong Zhang, Bin Ma
cs.AI
摘要
最近,生成对抗网络(GANs)在基于中间表示(如梅尔频谱图)的语音超分辨率(SR)方面取得了进展。然而,现有的SR方法通常依赖于独立训练和串联网络,可能导致不一致的表示和较差的语音质量,特别是在域外情况下。在这项工作中,我们提出了HiFi-SR,这是一个统一的网络,利用端到端的对抗训练来实现高保真度的语音超分辨率。我们的模型采用了一个统一的变压器-卷积生成器,旨在无缝处理潜在表示的预测及其转换为时域波形。变压器网络作为强大的编码器,将低分辨率的梅尔频谱图转换为潜在空间表示,而卷积网络则将这些表示升级为高分辨率波形。为了增强高频保真度,我们在对抗训练过程中结合了一个多频段、多尺度时频鉴别器,以及一个多尺度梅尔重构损失。HiFi-SR具有通用性,能够将4 kHz至32 kHz之间的任何输入语音信号提升到48 kHz的采样率。实验结果表明,HiFi-SR在客观指标和ABX偏好测试中明显优于现有的语音SR方法,无论是在域内还是域外情况下(https://github.com/modelscope/ClearerVoice-Studio)。
English
The application of generative adversarial networks (GANs) has recently
advanced speech super-resolution (SR) based on intermediate representations
like mel-spectrograms. However, existing SR methods that typically rely on
independently trained and concatenated networks may lead to inconsistent
representations and poor speech quality, especially in out-of-domain scenarios.
In this work, we propose HiFi-SR, a unified network that leverages end-to-end
adversarial training to achieve high-fidelity speech super-resolution. Our
model features a unified transformer-convolutional generator designed to
seamlessly handle both the prediction of latent representations and their
conversion into time-domain waveforms. The transformer network serves as a
powerful encoder, converting low-resolution mel-spectrograms into latent space
representations, while the convolutional network upscales these representations
into high-resolution waveforms. To enhance high-frequency fidelity, we
incorporate a multi-band, multi-scale time-frequency discriminator, along with
a multi-scale mel-reconstruction loss in the adversarial training process.
HiFi-SR is versatile, capable of upscaling any input speech signal between 4
kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that
HiFi-SR significantly outperforms existing speech SR methods across both
objective metrics and ABX preference tests, for both in-domain and
out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).Summary
AI-Generated Summary