ChatPaper.aiChatPaper

Seedream 2.0:原生中英双语图像生成基础模型

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

March 10, 2025
作者: Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Linjie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
cs.AI

摘要

扩散模型的快速发展极大地推动了图像生成领域的显著进步。然而,当前主流模型如Flux、SD3.5和Midjourney仍面临模型偏差、文本渲染能力有限以及对中华文化细节理解不足等问题。为应对这些局限,我们推出了Seedream 2.0,一款原生中英双语图像生成基础模型,其在多个维度上表现卓越,能够熟练处理中英文文本提示,支持双语图像生成与文本渲染。我们构建了一个强大的数据系统,促进知识整合,并开发了一个平衡描述准确性与丰富性的图像标注系统。特别地,Seedream集成了自主研发的双语大语言模型作为文本编码器,使其能够直接从海量数据中学习本土知识,从而生成高保真图像,精准展现中英文描述的文化细节与美学表达。此外,采用Glyph-Aligned ByT5实现灵活的字级文本渲染,而Scaled ROPE则能良好泛化至未训练的分辨率。通过包括SFT和RLHF迭代在内的多阶段后训练优化,进一步提升了整体能力。大量实验表明,Seedream 2.0在提示跟随、美学表现、文本渲染及结构正确性等多个方面均达到了业界领先水平。同时,经过多轮RLHF优化,Seedream 2.0的输出与人类偏好高度契合,其卓越的ELO评分便是明证。此外,该模型易于适配为基于指令的图像编辑模型,如SeedEdit,具备强大的编辑能力,在指令遵循与图像一致性之间取得良好平衡。
English
Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.

Summary

AI-Generated Summary

PDF302March 12, 2025