使用下一个标记扩散的多模态潜在语言建模

Multimodal Latent Language Modeling with Next-Token Diffusion

December 11, 2024
作者: Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei
cs.AI

摘要

多模态生成模型需要一种统一的方法来处理离散数据(例如文本和代码)和连续数据(例如图像、音频、视频)。在这项工作中,我们提出了潜在语言建模(LatentLM),它通过因果Transformer 无缝地集成连续和离散数据。具体来说,我们采用变分自编码器(VAE)来表示连续数据为潜在向量,并引入下一个令牌扩散用于这些向量的自回归生成。此外,我们开发了sigma-VAE 来解决方差坍缩的挑战,这对于自回归建模至关重要。大量实验证明了LatentLM 在各种模态下的有效性。在图像生成方面,LatentLM 在性能和可扩展性上均超越了Diffusion Transformers。当集成到多模态大型语言模型中时,LatentLM 提供了一个统一多模态生成和理解的通用接口。实验结果表明,与Transfusion 和向量量化模型相比,在扩大训练令牌的设置下,LatentLM 实现了有利的性能。在文本转语音合成方面,LatentLM 在说话者相似度和鲁棒性方面优于最先进的VALL-E 2 模型,同时需要更少的解码步骤。这些结果确立了LatentLM 作为推进大型多模态模型的高效且可扩展的方法。
English
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop sigma-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.

Summary

AI-Generated Summary

PDF442December 13, 2024