라이라: 만능 인지를 위한 효율적이고 음성 중심의 프레임워크

초록

다중 모달 대형 언어 모델(Multi-modal Large Language Models, MLLMs)이 발전함에 따라, 단일 도메인 능력을 넘어서 확장하는 것은 더 다양하고 효율적인 AI 수요를 충족시키기 위해 중요합니다. 그러나 이전 옴니 모델들은 음성을 충분히 탐구하지 않아 다중 모달과의 통합을 무시했습니다. 우리는 Lyra를 소개합니다. Lyra는 고급 장기 음성 이해, 소리 이해, 교차 모달 효율성, 그리고 원활한 음성 상호 작용을 포함한 다중 모달 능력을 향상시키는 효율적인 MLLM입니다. 효율성과 음성 중심 능력을 달성하기 위해 Lyra는 세 가지 전략을 활용합니다: (1) 기존 오픈 소스 대형 모델과 제안된 다중 모달 LoRA를 활용하여 훈련 비용과 데이터 요구 사항을 줄입니다; (2) 잠재 다중 모달 정규화기와 추출기를 사용하여 음성과 다른 모달 사이의 관계를 강화하여 모델 성능을 향상시킵니다; (3) 150만 개의 다중 모달(언어, 비전, 오디오) 데이터 샘플과 1만 2천 개의 장기 음성 샘플을 포함하는 고품질의 방대한 데이터셋을 구축하여 Lyra가 복잡한 장기 음성 입력을 처리하고 더 견고한 옴니 인식을 달성할 수 있게 합니다. 다른 옴니 방법들과 비교했을 때, Lyra는 다양한 비전-언어, 비전-음성, 그리고 음성-언어 벤치마크에서 최첨단 성능을 달성하면서도 더 적은 계산 자원과 훈련 데이터를 사용합니다.

English

As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.

라이라: 만능 인지를 위한 효율적이고 음성 중심의 프레임워크

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

초록

Support