Lyra:一個高效且以語音為中心的全知框架

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

December 12, 2024
作者: Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia
cs.AI

摘要

隨著多模態大型語言模型(MLLMs)的演進,擴展至單一領域以滿足對更多功能靈活且高效的人工智能的需求至關重要。然而,先前的全模型不足以探索語音,忽略了將其與多模態整合。我們介紹了 Lyra,一種高效的 MLLM,可增強多模態能力,包括先進的長篇語音理解、聲音理解、跨模態效率和無縫語音互動。為了實現高效和以語音為中心的能力,Lyra採用了三種策略:(1)利用現有的開源大型模型和提出的多模態 LoRA 來降低訓練成本和數據需求;(2)使用潛在的多模態正則化器和提取器來加強語音與其他模態之間的關係,從而提高模型性能;以及(3)構建一個高質量、龐大的數據集,包括 150 萬個多模態(語言、視覺、音頻)數據樣本和 12,000 個長篇語音樣本,使 Lyra 能夠處理複雜的長篇語音輸入,實現更強大的全知能。與其他全方法相比,Lyra 在各種視覺語言、視覺語音和語音語言基準測試中實現了最先進的性能,同時使用更少的計算資源和訓練數據。
English
As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.

Summary

AI-Generated Summary

PDF453December 13, 2024