FocalCodec:通过焦点调制网络实现低比特率语音编码
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
February 6, 2025
作者: Luca Della Libera, Francesco Paissan, Cem Subakan, Mirco Ravanelli
cs.AI
摘要
大型语言模型通过在大规模数据集上进行自监督预训练,彻底改变了自然语言处理。受此成功的启发,研究人员探索了将这些方法应用于语音,通过使用神经音频编解码器将连续音频离散化为标记。然而,现有方法存在一些限制,包括高比特率、语义或声学信息的丢失,以及在试图捕获两者时依赖多码书设计,这会增加下游任务的架构复杂性。为了解决这些挑战,我们引入了FocalCodec,这是一种高效的低比特率编解码器,基于焦点调制,利用单一二进制码书将语音压缩在0.16至0.65 kbps之间。FocalCodec在语音重合成和语音转换方面表现出色,比当前最先进技术在更低比特率下具有竞争性能,同时有效处理多语言语音和嘈杂环境。对下游任务的评估显示,FocalCodec成功保留了足够的语义和声学信息,同时也非常适合生成建模。演示样本、代码和检查点可在https://lucadellalib.github.io/focalcodec-web/ 上找到。
English
Large language models have revolutionized natural language processing through
self-supervised pretraining on massive datasets. Inspired by this success,
researchers have explored adapting these methods to speech by discretizing
continuous audio into tokens using neural audio codecs. However, existing
approaches face limitations, including high bitrates, the loss of either
semantic or acoustic information, and the reliance on multi-codebook designs
when trying to capture both, which increases architectural complexity for
downstream tasks. To address these challenges, we introduce FocalCodec, an
efficient low-bitrate codec based on focal modulation that utilizes a single
binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec
delivers competitive performance in speech resynthesis and voice conversion at
lower bitrates than the current state-of-the-art, while effectively handling
multilingual speech and noisy environments. Evaluation on downstream tasks
shows that FocalCodec successfully preserves sufficient semantic and acoustic
information, while also being well-suited for generative modeling. Demo
samples, code and checkpoints are available at
https://lucadellalib.github.io/focalcodec-web/.Summary
AI-Generated Summary