Whisper-GPT:一种混合表示音频大型语言模型

Whisper-GPT: A Hybrid Representation Audio Large Language Model

December 16, 2024
作者: Prateek Verma
cs.AI

摘要

我们提出了WHISPER-GPT:一种用于语音和音乐的生成式大型语言模型(LLM),使我们能够同时处理连续音频表示和离散标记,作为单一架构的一部分。近年来,利用神经压缩算法(例如ENCODEC)导出的离散音频标记的生成式音频、语音和音乐模型大幅增长。然而,这种方法的一个主要缺点是处理上下文长度。如果必须考虑下一个标记预测时各种频率的所有音频内容,对于高保真生成式架构来说,这会变得非常复杂。通过结合诸如频谱图之类的连续音频表示和离散声学标记,我们保留了两者的优点:在单个标记中获取特定时间点的音频所需的所有信息,同时允许LLM预测未来的标记,以实现采样和其他离散空间提供的优势。我们展示了我们的架构如何改善了对下一个标记预测的困惑度和负对数似然分数,与基于标记的语音和音乐LLM相比。
English
We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.

Summary

AI-Generated Summary

PDF42December 17, 2024