利用下一令牌擴散進行多模態潛在語言建模。
Multimodal Latent Language Modeling with Next-Token Diffusion
December 11, 2024
作者: Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei
cs.AI
摘要
多模式生成模型需要統一的方法來處理離散數據(例如文本和代碼)和連續數據(例如圖像、音頻、視頻)。在這項工作中,我們提出了潛在語言建模(Latent Language Modeling,LatentLM),它使用因果Transformer無縫集成連續和離散數據。具體來說,我們採用變分自編碼器(VAE)來表示連續數據為潛在向量,並引入下一令牌擴散,用於自回歸生成這些向量。此外,我們開發了sigma-VAE來應對變異坍縮的挑戰,這對於自回歸建模至關重要。大量實驗證明了LatentLM在各種模態下的有效性。在圖像生成方面,LatentLM在性能和可擴展性上均超越了Diffusion Transformers。當集成到多模式大型語言模型中時,LatentLM提供了一個統一的多模式生成和理解通用接口。實驗結果顯示,在擴大訓練令牌的設置下,LatentLM相較於Transfusion和向量量化模型,實現了良好的性能。在文本轉語音合成方面,LatentLM在語者相似性和韌性方面優於最先進的VALL-E 2模型,同時需要的解碼步驟少了10倍。這些結果確立了LatentLM作為推進大型多模式模型的高效可擴展方法。
English
Multimodal generative models require a unified approach to handle both
discrete data (e.g., text and code) and continuous data (e.g., image, audio,
video). In this work, we propose Latent Language Modeling (LatentLM), which
seamlessly integrates continuous and discrete data using causal Transformers.
Specifically, we employ a variational autoencoder (VAE) to represent continuous
data as latent vectors and introduce next-token diffusion for autoregressive
generation of these vectors. Additionally, we develop sigma-VAE to address
the challenges of variance collapse, which is crucial for autoregressive
modeling. Extensive experiments demonstrate the effectiveness of LatentLM
across various modalities. In image generation, LatentLM surpasses Diffusion
Transformers in both performance and scalability. When integrated into
multimodal large language models, LatentLM provides a general-purpose interface
that unifies multimodal generation and understanding. Experimental results show
that LatentLM achieves favorable performance compared to Transfusion and vector
quantized models in the setting of scaling up training tokens. In
text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2
model in speaker similarity and robustness, while requiring 10x fewer decoding
steps. The results establish LatentLM as a highly effective and scalable
approach to advance large multimodal models.Summary
AI-Generated Summary