LDGen：通过大语言模型驱动的语言表征增强文本到图像合成

摘要

本文介绍了LDGen，一种将大型语言模型（LLMs）融入现有文本到图像扩散模型的新方法，同时最大限度地减少计算需求。传统的文本编码器，如CLIP和T5，在多语言处理方面存在局限，阻碍了跨多种语言的图像生成。我们通过利用LLMs的先进能力来解决这些挑战。我们的方法采用了一种语言表示策略，应用分层标题优化和人类指令技术来提取精确的语义信息。随后，我们引入了一个轻量级适配器和一个跨模态精炼器，以促进LLMs与图像特征之间的高效特征对齐和交互。LDGen减少了训练时间，并实现了零样本多语言图像生成。实验结果表明，我们的方法在提示遵循和图像美学质量方面均超越了基线模型，同时无缝支持多种语言。页面：https://zrealli.github.io/LDGen。

English

In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit limitations in multilingual processing, hindering image generation across diverse languages. We address these challenges by leveraging the advanced capabilities of LLMs. Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information,. Subsequently, we incorporate a lightweight adapter and a cross-modal refiner to facilitate efficient feature alignment and interaction between LLMs and image features. LDGen reduces training time and enables zero-shot multilingual image generation. Experimental results indicate that our method surpasses baseline models in both prompt adherence and image aesthetic quality, while seamlessly supporting multiple languages. Project page: https://zrealli.github.io/LDGen.

LDGen：通过大语言模型驱动的语言表征增强文本到图像合成

LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation

摘要

Summary

Support

Support