대형 언어 모델을 활용한 초인간 수준의 음성 이해를 향한 로드맵

초록

대형 언어 모델(LLMs)의 성공은 음성 및 오디오 데이터를 통합하는 노력을 촉발시켰으며, 텍스트 및 비텍스트 입력을 처리할 수 있는 일반적인 기반 모델을 만들기 위한 목표를 가지고 있습니다. GPT-4o와 같은 최근의 발전은 엔드-투-엔드 음성 LLMs의 잠재력을 강조하며, 이는 비의미론적 정보와 세계적 지식을 보존하여 보다 심층적인 음성 이해를 가능하게 합니다. 음성 LLMs의 개발을 안내하기 위해, 우리는 자동 음성 인식(ASR)부터 비의미론적 정보를 추상적인 음향 지식과 통합할 수 있는 고급 초인 모델에 이르기까지 다섯 가지 수준의 로드맵을 제안합니다. 더불어, 우리는 SAGI 벤치마크라는 벤치마크를 설계하여 이 다섯 가지 수준에서 다양한 작업에 걸쳐 중요한 측면을 표준화하고, 추상적인 음향 지식의 활용과 능력의 완전성에 대한 도전 과제를 밝혀냅니다. 우리의 연구 결과는 부연언어적 단서와 추상적인 음향 지식을 처리하는 데의 간극을 드러내며, 미래 방향을 제시합니다. 본 논문은 음성 LLMs의 발전을 위한 로드맵을 개요로 제시하고, 평가를 위한 벤치마크를 소개하며, 현재의 한계와 잠재력에 대한 주요 통찰을 제공합니다.

English

The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling paralinguistic cues and abstract acoustic knowledge, and we offer future directions. This paper outlines a roadmap for advancing speech LLMs, introduces a benchmark for evaluation, and provides key insights into their current limitations and potential.

대형 언어 모델을 활용한 초인간 수준의 음성 이해를 향한 로드맵

Roadmap towards Superhuman Speech Understanding using Large Language Models

초록

Support