Show-o Turbo: 朝着加速统一多模态理解与生成的方向前进

摘要

近年来，建立统一的多模态理解和生成模型引起了越来越多的研究兴趣，其中Show-o作为一个显著代表，展现出在文本到图像和图像到文本生成方面具有巨大潜力。Show-o的推断包括逐步去噪图像标记和自回归解码文本标记，因此，不幸的是，它在两方面都存在效率问题。本文介绍了Show-o Turbo来弥合这一差距。我们首先基于文本标记的并行解码，确定了Show-o中图像和文本生成的统一去噪视角。然后，我们提出将一种用于缩短扩散模型去噪过程的合格方法——一致性蒸馏（CD），扩展到Show-o的多模态去噪轨迹。我们引入了一种轨迹分割策略和课程学习程序来提高训练收敛性。从经验上看，在文本到图像生成中，Show-o Turbo在4个采样步骤下展示了0.625的GenEval分数，而无需使用无分类器指导（CFG），优于原始的具有8个步骤和CFG的Show-o；在图像到文本生成中，Show-o Turbo表现出1.5倍的加速，而不会显著牺牲性能。代码可在https://github.com/zhijie-group/Show-o-Turbo找到。

English

There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at https://github.com/zhijie-group/Show-o-Turbo.

Show-o Turbo: 朝着加速统一多模态理解与生成的方向前进

Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

摘要

Summary

Support