FocalCodec：通过焦点调制网络实现低比特率语音编码

摘要

大型语言模型通过在大规模数据集上进行自监督预训练，彻底改变了自然语言处理。受此成功的启发，研究人员探索了将这些方法应用于语音，通过使用神经音频编解码器将连续音频离散化为标记。然而，现有方法存在一些限制，包括高比特率、语义或声学信息的丢失，以及在试图捕获两者时依赖多码书设计，这会增加下游任务的架构复杂性。为了解决这些挑战，我们引入了FocalCodec，这是一种高效的低比特率编解码器，基于焦点调制，利用单一二进制码书将语音压缩在0.16至0.65 kbps之间。FocalCodec在语音重合成和语音转换方面表现出色，比当前最先进技术在更低比特率下具有竞争性能，同时有效处理多语言语音和嘈杂环境。对下游任务的评估显示，FocalCodec成功保留了足够的语义和声学信息，同时也非常适合生成建模。演示样本、代码和检查点可在https://lucadellalib.github.io/focalcodec-web/ 上找到。

English

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.

FocalCodec：通过焦点调制网络实现低比特率语音编码

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

摘要

Summary

Support