OpenOmni: 大規模言語モデルが、リアルタイムの自己認識感情音声合成を介して、言語を超えたゼロショットオムニモーダルアライメントを実現します。

要旨

最近、画像、テキスト、音声の理解と生成において、オムニモーダル学習の最新の進展が、主にプロプライエタリなモデル内で達成されています。しかし、オープンソースの進展が妨げられてきたのは、オムニモーダルデータセットの限られた利用と、リアルタイムの感情音声生成に関連する固有の課題です。これらの問題に対処するために、私たちはオープンオムニという、オムニモーダルのアラインメントと音声生成を組み合わせた2段階トレーニング手法を提案します。アラインメントフェーズでは、事前にトレーニングされた音声モデルをテキスト-画像タスクでさらにトレーニングし、ビジョンから音声への一般化を（ほぼ）ゼロショットで達成し、トライモーダルデータセットでトレーニングされたモデルを凌駕します。音声生成フェーズでは、軽量なデコーダが、音声タスクと好み学習によってリアルタイムの感情音声を容易にし、実現します。実験では、オープンオムニがオムニモーダル、ビジョン-言語、音声-言語の評価において一貫して改善され、自然で感情豊かな対話とリアルタイムの感情音声生成を可能にします。

English

Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.

OpenOmni: 大規模言語モデルが、リアルタイムの自己認識感情音声合成を介して、言語を超えたゼロショットオムニモーダルアライメントを実現します。

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

要旨

Summary

Support