ChatPaper.aiChatPaper

Mini-Omni2:邁向具有視覺、語音和雙工能力的開源GPT-4o

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

October 15, 2024
作者: Zhifei Xie, Changqiao Wu
cs.AI

摘要

GPT-4o是一個全方位模型,代表了大型多模態語言模型發展的里程碑。它可以理解視覺、聽覺和文本模態,直接輸出音頻,並支持靈活的雙工互動。來自開源社區的模型通常可以實現GPT-4o的一些功能,如視覺理解和語音聊天。然而,由於多模態數據的複雜性、複雜的模型架構和訓練過程,訓練一個統一的模型來整合所有模態是具有挑戰性的。在本文中,我們介紹了Mini-Omni2,這是一個視覺-音頻助手,能夠對視覺和音頻查詢提供即時端到端語音回應。通過集成預訓練的視覺和聽覺編碼器,Mini-Omni2在個別模態中保持性能。我們提出了一個三階段訓練過程來對齊模態,使語言模型在有限數據集上訓練後能夠處理多模態輸入和輸出。對於互動,我們引入了基於命令的中斷機制,從而實現與用戶更靈活的互動。據我們所知,Mini-Omni2是對GPT-4o最接近的再現之一,具有類似的功能形式,我們希望它能為後續研究提供有價值的見解。
English
GPT-4o, an all-encompassing model, represents a milestone in the development of large multi-modal language models. It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction. Models from the open-source community often achieve some functionalities of GPT-4o, such as visual understanding and voice chat. Nevertheless, training a unified model that incorporates all modalities is challenging due to the complexities of multi-modal data, intricate model architectures, and training processes. In this paper, we introduce Mini-Omni2, a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries. By integrating pretrained visual and auditory encoders, Mini-Omni2 maintains performance in individual modalities. We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset. For interaction, we introduce a command-based interruption mechanism, enabling more flexible interaction with users. To the best of our knowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which have similar form of functionality, and we hope it can offer valuable insights for subsequent research.

Summary

AI-Generated Summary

PDF222November 16, 2024