Lyra：一种高效且以语音为中心的全知认知框架

摘要

随着多模态大型语言模型（MLLMs）的发展，扩展到超越单一领域能力是满足对更多功能齐全和高效人工智能需求的关键。然而，先前的全模态模型未充分探索语音，忽视了其与多模态整合的重要性。我们介绍了Lyra，一种高效的MLLM，增强了多模态能力，包括高级长篇语音理解、声音理解、跨模态效率和无缝语音交互。为了实现高效和以语音为中心的能力，Lyra采用了三种策略：（1）利用现有的开源大型模型和提出的多模态LoRA来降低训练成本和数据需求；（2）使用潜在的多模态正则化器和提取器来加强语音与其他模态之间的关系，从而增强模型性能；（3）构建一个高质量、广泛的数据集，包括150万个多模态（语言、视觉、音频）数据样本和1.2万个长篇语音样本，使Lyra能够处理复杂的长篇语音输入，并实现更强大的全认知能力。与其他全方法相比，Lyra在各种视觉-语言、视觉-语音和语音-语言基准测试中实现了最先进的性能，同时利用更少的计算资源和更少的训练数据。

English

As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.

Lyra：一种高效且以语音为中心的全知认知框架

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

摘要

Support