LDGen:通过大语言模型驱动的语言表征增强文本到图像合成
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation
February 25, 2025
作者: Pengzhi Li, Pengfei Yu, Zide Liu, Wei He, Xuhao Pan, Xudong Rao, Tao Wei, Wei Chen
cs.AI
摘要
本文介绍了LDGen,一种将大型语言模型(LLMs)融入现有文本到图像扩散模型的新方法,同时最大限度地减少计算需求。传统的文本编码器,如CLIP和T5,在多语言处理方面存在局限,阻碍了跨多种语言的图像生成。我们通过利用LLMs的先进能力来解决这些挑战。我们的方法采用了一种语言表示策略,应用分层标题优化和人类指令技术来提取精确的语义信息。随后,我们引入了一个轻量级适配器和一个跨模态精炼器,以促进LLMs与图像特征之间的高效特征对齐和交互。LDGen减少了训练时间,并实现了零样本多语言图像生成。实验结果表明,我们的方法在提示遵循和图像美学质量方面均超越了基线模型,同时无缝支持多种语言。页面:https://zrealli.github.io/LDGen。
English
In this paper, we introduce LDGen, a novel method for integrating large
language models (LLMs) into existing text-to-image diffusion models while
minimizing computational demands. Traditional text encoders, such as CLIP and
T5, exhibit limitations in multilingual processing, hindering image generation
across diverse languages. We address these challenges by leveraging the
advanced capabilities of LLMs. Our approach employs a language representation
strategy that applies hierarchical caption optimization and human instruction
techniques to derive precise semantic information,. Subsequently, we
incorporate a lightweight adapter and a cross-modal refiner to facilitate
efficient feature alignment and interaction between LLMs and image features.
LDGen reduces training time and enables zero-shot multilingual image
generation. Experimental results indicate that our method surpasses baseline
models in both prompt adherence and image aesthetic quality, while seamlessly
supporting multiple languages. Project page: https://zrealli.github.io/LDGen.Summary
AI-Generated Summary