X-Prompt:走向自回归视觉语言基础模型中的通用上下文图像生成
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
December 2, 2024
作者: Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
cs.AI
摘要
上下文生成是大型语言模型(LLMs)开放任务泛化能力的关键组成部分。通过利用少量示例作为上下文,LLMs可以执行领域内和领域外的任务。建立在LLMs基础上的自回归视觉语言模型(VLMs)的最新进展展示了在文本到图像生成方面的出色性能。然而,利用上下文学习进行一般图像生成任务的潜力仍然大部分未被探索。为了解决这个问题,我们引入了X-Prompt,一个纯自回归的大型视觉语言模型,旨在在统一的上下文学习框架内,在各种已见和未见的图像生成任务中提供竞争性能力。X-Prompt采用了一种专门设计,可以高效地压缩来自上下文示例的有价值特征,支持更长的上下文标记序列,并提高其泛化到未见任务的能力。用于文本和图像预测的统一训练任务使X-Prompt能够通过上下文示例增强任务意识来处理一般图像生成。广泛的实验证实了该模型在各种已见图像生成任务中的性能以及其泛化到以前未见任务的能力。
English
In-context generation is a key component of large language models' (LLMs)
open-task generalization capability. By leveraging a few examples as context,
LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in
auto-regressive vision-language models (VLMs) built upon LLMs have showcased
impressive performance in text-to-image generation. However, the potential of
in-context learning for general image generation tasks remains largely
unexplored. To address this, we introduce X-Prompt, a purely auto-regressive
large-vision language model designed to deliver competitive performance across
a wide range of both seen and unseen image generation tasks, all within a
unified in-context learning framework. X-Prompt incorporates a specialized
design that efficiently compresses valuable features from in-context examples,
supporting longer in-context token sequences and improving its ability to
generalize to unseen tasks. A unified training task for both text and image
prediction enables X-Prompt to handle general image generation with enhanced
task awareness from in-context examples. Extensive experiments validate the
model's performance across diverse seen image generation tasks and its capacity
to generalize to previously unseen tasks.Summary
AI-Generated Summary