Whisper-GPT:一個混合表示音訊大型語言模型
Whisper-GPT: A Hybrid Representation Audio Large Language Model
December 16, 2024
作者: Prateek Verma
cs.AI
摘要
我們提出了WHISPER-GPT:一種用於語音和音樂的生成式大型語言模型(LLM),使我們能夠同時使用連續音頻表示和離散標記,作為單一架構的一部分。近來出現了大量生成式音頻、語音和音樂模型,這些模型利用從神經壓縮算法(例如ENCODEC)衍生的離散音頻標記。然而,這種方法的一個主要缺點是處理上下文長度。如果要考慮下一個標記的所有音頻內容在各種頻率上的情況,對於高保真生成式架構來說,這個上下文會急劇增加。通過結合連續音頻表示(如頻譜圖)和離散聲學標記,我們保留了兩者的優點:在單個標記中獲得特定時間點的所有所需音頻信息,同時允許LLM預測未來的標記,以實現採樣和其他離散空間提供的好處。我們展示了我們的架構如何改善對語音和音樂下一個標記預測的困惑度和負對數似然分數,相較於基於標記的LLM。
English
We propose WHISPER-GPT: A generative large language model (LLM) for speech
and music that allows us to work with continuous audio representations and
discrete tokens simultaneously as part of a single architecture. There has been
a huge surge in generative audio, speech, and music models that utilize
discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC.
However, one of the major drawbacks of this approach is handling the context
length. It blows up for high-fidelity generative architecture if one has to
account for all the audio contents at various frequencies for the next token
prediction. By combining continuous audio representation like the spectrogram
and discrete acoustic tokens, we retain the best of both worlds: Have all the
information needed from the audio at a specific time instance in a single
token, yet allow LLM to predict the future token to allow for sampling and
other benefits discrete space provides. We show how our architecture improves
the perplexity and negative log-likelihood scores for the next token prediction
compared to a token-based LLM for speech and music.Summary
AI-Generated Summary