Ola:通过渐进式模态对齐推动全模态语言模型的前沿
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
February 6, 2025
作者: Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
cs.AI
摘要
最近在大型语言模型方面的进展,特别是在GPT-4o之后,引发了对开发全模态模型的越来越浓厚兴趣,这些模型能够理解更多的模态。虽然一些开源替代方案已经出现,但在性能上仍然明显落后于专门的单模态模型。本文介绍了Ola,一种全模态语言模型,与专门的对应模型相比,在图像、视频和音频理解方面取得了竞争性能。Ola的核心设计在于其渐进式模态对齐策略,逐步扩展语言模型的支持模态。我们的训练流程从最不同的模态开始:图像和文本,然后逐渐扩展模型的技能集,使用连接语言和音频知识的语音数据,以及连接所有模态的视频数据。渐进式学习流程还使我们能够保持跨模态对齐数据的相对较小规模,使得从现有的视觉-语言模型开发全模态模型变得简单且成本较低。此外,为了实现类似GPT-4o的高级交互体验,我们进一步设计了一种逐句解码解决方案,用于流式语音生成。大量实验证明,Ola在所有模态上均超越了现有的开源全模态LLMs,同时与同等规模的最先进专门模型相比取得了高度竞争性能。我们的目标是将Ola打造成一个完全开放的全模态理解解决方案,推动这一新兴领域的未来研究。模型权重、代码和数据已在https://github.com/Ola-Omni/Ola上开源。
English
Recent advances in large language models, particularly following GPT-4o, have
sparked increasing interest in developing omni-modal models capable of
understanding more modalities. While some open-source alternatives have
emerged, there is still a notable lag behind specialized single-modality models
in performance. In this paper, we present Ola, an Omni-modal language model
that achieves competitive performance across image, video, and audio
understanding compared to specialized counterparts. The core design of Ola lies
in its progressive modality alignment strategy that extends the supporting
modality of the language model progressively. Our training pipeline begins with
the most distinct modalities: image and text, then gradually expands the skill
sets of the model using speech data that connects language and audio knowledge,
and video data that connects all modalities. The progressive learning pipeline
also enables us to maintain a relatively small size of the cross-modal
alignment data, making developing omni-modal from existing vision-language
models easy and less costly. Moreover, to unlock an advanced interactive
experience like GPT-4o, we further design a sentence-wise decoding solution for
streaming speech generation. Extensive experiments demonstrate that Ola
surpasses existing open omni-modal LLMs across all modalities while achieving
highly competitive performance compared to state-of-the-art specialized models
of similar sizes. We aim to make Ola a fully open omni-modal understanding
solution to advance future research in this emerging field. Model weights,
code, and data are open-sourced at https://github.com/Ola-Omni/Ola.Summary
AI-Generated Summary