휘스퍼-GPT: 하이브리드 표현 오디오 대규모 언어 모델

초록

우리는 WHISPER-GPT를 제안합니다: 연속 오디오 표현과 이산 토큰을 동시에 처리할 수 있는 음성 및 음악용 생성형 대형 언어 모델 (LLM)입니다. 최근에는 ENCODEC와 같은 신경 압축 알고리즘에서 파생된 이산 오디오 토큰을 활용하는 생성형 오디오, 음성, 음악 모델이 대대적으로 증가했습니다. 그러나 이 방법의 주요 단점 중 하나는 문맥 길이를 처리하는 것입니다. 다음 토큰 예측을 위해 모든 오디오 콘텐츠를 다양한 주파수로 고려해야 하는 경우, 고품질 생성 구조에서 커져버립니다. 스펙트로그램과 같은 연속 오디오 표현과 이산 음향 토큰을 결합함으로써 우리는 두 가지 방법의 장점을 유지합니다: 특정 시간 지점의 오디오에서 필요한 모든 정보를 단일 토큰으로 보유하면서도 LLM이 미래 토큰을 예측하고 샘플링 및 기타 혜택을 제공할 수 있도록 합니다. 우리의 구조가 음성 및 음악용 토큰 기반 LLM에 비해 다음 토큰 예측의 난해도와 음의 로그 우도 점수를 개선하는 방법을 보여줍니다.

English

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.

휘스퍼-GPT: 하이브리드 표현 오디오 대규모 언어 모델

Whisper-GPT: A Hybrid Representation Audio Large Language Model

초록

Support