ChatPaper.aiChatPaper

Show-o Turbo: 朝着加速统一多模态理解与生成的方向前进

Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

February 8, 2025
作者: Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng
cs.AI

摘要

近年来,建立统一的多模态理解和生成模型引起了越来越多的研究兴趣,其中Show-o作为一个显著代表,展现出在文本到图像和图像到文本生成方面具有巨大潜力。Show-o的推断包括逐步去噪图像标记和自回归解码文本标记,因此,不幸的是,它在两方面都存在效率问题。本文介绍了Show-o Turbo来弥合这一差距。我们首先基于文本标记的并行解码,确定了Show-o中图像和文本生成的统一去噪视角。然后,我们提出将一种用于缩短扩散模型去噪过程的合格方法——一致性蒸馏(CD),扩展到Show-o的多模态去噪轨迹。我们引入了一种轨迹分割策略和课程学习程序来提高训练收敛性。从经验上看,在文本到图像生成中,Show-o Turbo在4个采样步骤下展示了0.625的GenEval分数,而无需使用无分类器指导(CFG),优于原始的具有8个步骤和CFG的Show-o;在图像到文本生成中,Show-o Turbo表现出1.5倍的加速,而不会显著牺牲性能。代码可在https://github.com/zhijie-group/Show-o-Turbo找到。
English
There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at https://github.com/zhijie-group/Show-o-Turbo.

Summary

AI-Generated Summary

PDF222February 11, 2025