MIO:多模態標記的基礎模型
MIO: A Foundation Model on Multimodal Tokens
September 26, 2024
作者: Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang
cs.AI
摘要
本文介紹了MIO,一個建立在多模態標記上的新型基礎模型,能夠以端對端、自回歸的方式理解和生成語音、文本、圖像和視頻。儘管大型語言模型(LLMs)和多模態大型語言模型(MM-LLMs)的出現通過其多功能能力推動了人工通用智能的進步,但它們仍然缺乏真正的任意-任意理解和生成。最近,GPT-4o的釋出展示了任意-任意LLMs在複雜現實任務中的巨大潛力,實現了跨圖像、語音和文本的全方位輸入和輸出。然而,它是封閉源碼的,並不支持生成多模態交錯序列。為了填補這一空白,我們提出了MIO,它是通過因果多模態建模對四種模態的離散標記進行訓練。MIO經歷了四階段的訓練過程:(1)對齊預訓練,(2)交錯預訓練,(3)語音增強預訓練和(4)在多樣的文本、視覺和語音任務上進行全面監督微調。我們的實驗結果表明,與以前的雙模態基線、任意-任意模型基線甚至特定模態基線相比,MIO表現出有競爭力的,有時甚至是優越的性能。此外,MIO展示了與其任意-任意特性相關的先進功能,如交錯視頻-文本生成、視覺思維推理、視覺指導生成、指導性圖像編輯等。
English
In this paper, we introduce MIO, a novel foundation model built on multimodal
tokens, capable of understanding and generating speech, text, images, and
videos in an end-to-end, autoregressive manner. While the emergence of large
language models (LLMs) and multimodal large language models (MM-LLMs) propels
advancements in artificial general intelligence through their versatile
capabilities, they still lack true any-to-any understanding and generation.
Recently, the release of GPT-4o has showcased the remarkable potential of
any-to-any LLMs for complex real-world tasks, enabling omnidirectional input
and output across images, speech, and text. However, it is closed-source and
does not support the generation of multimodal interleaved sequences. To address
this gap, we present MIO, which is trained on a mixture of discrete tokens
across four modalities using causal multimodal modeling. MIO undergoes a
four-stage training process: (1) alignment pre-training, (2) interleaved
pre-training, (3) speech-enhanced pre-training, and (4) comprehensive
supervised fine-tuning on diverse textual, visual, and speech tasks. Our
experimental results indicate that MIO exhibits competitive, and in some cases
superior, performance compared to previous dual-modal baselines, any-to-any
model baselines, and even modality-specific baselines. Moreover, MIO
demonstrates advanced capabilities inherent to its any-to-any feature, such as
interleaved video-text generation, chain-of-visual-thought reasoning, visual
guideline generation, instructional image editing, etc.Summary
AI-Generated Summary