DM-Codec: Het destilleren van multimodale representaties voor spraaktokenisatie.

Samenvatting

Recente ontwikkelingen in spraak-taalmodellen hebben aanzienlijke verbeteringen opgeleverd in spraaktokenisatie en -synthese. Het effectief in kaart brengen van de complexe, multidimensionale kenmerken van spraak in discrete tokens blijft echter een uitdaging. Dit proces vereist akoestische, semantische en contextuele informatie voor nauwkeurige spraakrepresentaties. Bestaande spraakrepresentaties vallen over het algemeen in twee categorieën: akoestische tokens van audiocodecs en semantische tokens van spraakzelftoezichtlermodellen. Hoewel recente inspanningen akoestische en semantische tokens hebben verenigd voor verbeterde prestaties, verwaarlozen ze de cruciale rol van contextuele representatie in uitgebreide spraakmodellering. Onze empirische onderzoeken tonen aan dat het ontbreken van contextuele representaties leidt tot verhoogde Word Error Rate (WER) en Word Information Lost (WIL) scores in spraaktranscripties. Om deze beperkingen aan te pakken, stellen we twee nieuwe distillatiebenaderingen voor: (1) een distillatiemethode geleid door een taalmodel (LM) die contextuele informatie opneemt, en (2) een gecombineerde LM en zelftoezicht spraakmodel (SM)-geleide distillatietechniek die multimodale representaties (akoestisch, semantisch en contextueel) effectief distilleert tot een uitgebreide spraaktokenizer, genaamd DM-Codec. De DM-Codec-architectuur neemt een gestroomlijnd encoder-decoderkader aan met een Residual Vector Quantizer (RVQ) en neemt het LM en SM op tijdens het trainingsproces. Experimenten tonen aan dat DM-Codec aanzienlijk beter presteert dan state-of-the-art spraaktokeniseringsmodellen, waarbij WER met maximaal 13,46% wordt verlaagd, WIL met 9,82% en de spraakkwaliteit met 5,84% en de verstaanbaarheid met 1,85% worden verbeterd op de benchmarkdataset LibriSpeech. De code, voorbeelden en modelcontrolepunten zijn beschikbaar op https://github.com/mubtasimahasan/DM-Codec.

English

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.

DM-Codec: Het destilleren van multimodale representaties voor spraaktokenisatie.

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

Samenvatting

Summary

Support