DM-코덱: 음성 토큰화를 위한 다중 모달 표현 축소

초록

최근 음성 언어 모델의 발전은 음성 토큰화와 합성에서 상당한 향상을 이끌어 냈습니다. 그러나 음성의 복잡하고 다차원 속성을 이산 토큰으로 효과적으로 매핑하는 것은 여전히 어려운 과제입니다. 이 과정은 정확한 음성 표현을 위해 음향, 의미 및 문맥 정보를 요구합니다. 기존의 음성 표현은 일반적으로 오디오 코덱에서 나오는 음향 토큰과 음성 자가 지도 학습 모델에서 나오는 의미 토큰 두 가지 범주로 나뉩니다. 최근 노력들은 음향과 의미 토큰을 통합하여 성능을 향상시켰지만, 포괄적인 음성 모델링에서 문맥 표현의 중요한 역할을 간과하고 있습니다. 우리의 경험적 조사 결과, 문맥 표현의 부재는 음성 전사에서 단어 오류율(WER) 및 단어 정보 손실(WIL) 점수가 상승하는 결과를 초래합니다. 이러한 한계를 극복하기 위해 우리는 두 가지 새로운 증류 접근 방식을 제안합니다: (1) 문맥 정보를 통합하는 언어 모델(LM)-지도 증류 방법, 그리고 (2) 효과적으로 다중 모달 표현(음향, 의미 및 문맥)을 증류하는 결합 LM 및 자가 지도 음성 모델(SM)-지도 증류 기술, DM-코덱이라는 포괄적인 음성 토크나이저로 구현됩니다. DM-코덱 아키텍처는 잔차 벡터 양자화기(RVQ)를 갖춘 간소화된 인코더-디코더 프레임워크를 채택하고 훈련 과정 중 LM 및 SM을 통합합니다. 실험 결과, DM-코덱은 최첨단 음성 토큰화 모델들을 크게 능가하여 LibriSpeech 벤치마크 데이터셋에서 WER을 최대 13.46%, WIL을 9.82% 감소시키고 음성 품질을 5.84%, 명료성을 1.85% 향상시킵니다. 코드, 샘플 및 모델 체크포인트는 https://github.com/mubtasimahasan/DM-Codec에서 확인할 수 있습니다.

English

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.

DM-코덱: 음성 토큰화를 위한 다중 모달 표현 축소

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

초록

Summary

Support