Lyra:一种高效且以语音为中心的全知认知框架
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
December 12, 2024
作者: Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia
cs.AI
摘要
随着多模态大型语言模型(MLLMs)的发展,扩展到超越单一领域能力是满足对更多功能齐全和高效人工智能需求的关键。然而,先前的全模态模型未充分探索语音,忽视了其与多模态整合的重要性。我们介绍了Lyra,一种高效的MLLM,增强了多模态能力,包括高级长篇语音理解、声音理解、跨模态效率和无缝语音交互。为了实现高效和以语音为中心的能力,Lyra采用了三种策略:(1)利用现有的开源大型模型和提出的多模态LoRA来降低训练成本和数据需求;(2)使用潜在的多模态正则化器和提取器来加强语音与其他模态之间的关系,从而增强模型性能;(3)构建一个高质量、广泛的数据集,包括150万个多模态(语言、视觉、音频)数据样本和1.2万个长篇语音样本,使Lyra能够处理复杂的长篇语音输入,并实现更强大的全认知能力。与其他全方法相比,Lyra在各种视觉-语言、视觉-语音和语音-语言基准测试中实现了最先进的性能,同时利用更少的计算资源和更少的训练数据。
English
As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond
single-domain capabilities is essential to meet the demands for more versatile
and efficient AI. However, previous omni-models have insufficiently explored
speech, neglecting its integration with multi-modality. We introduce Lyra, an
efficient MLLM that enhances multimodal abilities, including advanced
long-speech comprehension, sound understanding, cross-modality efficiency, and
seamless speech interaction. To achieve efficiency and speech-centric
capabilities, Lyra employs three strategies: (1) leveraging existing
open-source large models and a proposed multi-modality LoRA to reduce training
costs and data requirements; (2) using a latent multi-modality regularizer and
extractor to strengthen the relationship between speech and other modalities,
thereby enhancing model performance; and (3) constructing a high-quality,
extensive dataset that includes 1.5M multi-modal (language, vision, audio) data
samples and 12K long speech samples, enabling Lyra to handle complex long
speech inputs and achieve more robust omni-cognition. Compared to other
omni-methods, Lyra achieves state-of-the-art performance on various
vision-language, vision-speech, and speech-language benchmarks, while also
using fewer computational resources and less training data.Summary
AI-Generated Summary