X-Prompt:朝向自動回歸視覺語言基礎模型中的通用上下文圖像生成

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

December 2, 2024
作者: Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
cs.AI

摘要

在上下文生成是大型語言模型(LLMs)開放任務泛化能力的關鍵組成部分。通過利用一些例子作為上下文,LLMs能夠執行領域內和領域外的任務。建立在LLMs基礎上的自回歸視覺語言模型(VLMs)的最新進展展示了在文本到圖像生成方面的出色表現。然而,針對一般圖像生成任務的上下文學習潛力仍然大部分未被探索。為了應對這一挑戰,我們引入了X-Prompt,這是一個純自回歸的大視覺語言模型,旨在在統一的上下文學習框架內,在眾多已見和未見的圖像生成任務上提供有競爭力的表現。X-Prompt採用了一種專門設計,有效地從上下文示例中壓縮有價值的特徵,支持更長的上下文令牌序列,並提高其對未見任務的泛化能力。同時訓練文本和圖像預測的統一任務使X-Prompt能夠通過上下文示例增強對一般圖像生成的任務意識。大量實驗驗證了該模型在各種已見圖像生成任務上的表現以及其泛化到以前未見任務的能力。
English
In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.

Summary

AI-Generated Summary

PDF652December 3, 2024