用於擴散Transformer的上下文LoRA
In-Context LoRA for Diffusion Transformers
October 31, 2024
作者: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, Jingren Zhou
cs.AI
摘要
最近的研究 arXiv:2410.15027 探索了擴散Transformer(DiTs)在任務不可知的圖像生成中的應用,方法是通過簡單地將注意力標記串聯在圖像之間。然而,儘管使用了大量計算資源,生成的圖像保真度仍然不理想。在這項研究中,我們重新評估並精簡了這個框架,假設文本到圖像的DiTs本質上具有上下文生成能力,只需要進行最小程度的調整來激活它們。通過多樣的任務實驗,我們在質量上展示了現有的文本到圖像DiTs可以有效地執行上下文生成而無需任何調整。基於這一見解,我們提出了一個非常簡單的流程來利用DiTs的上下文能力:(1)串聯圖像而不是標記,(2)對多個圖像進行聯合標註,(3)使用小數據集(例如20至100個樣本)進行任務特定的LoRA調整,而不是使用大數據集進行全參數調整。我們將我們的模型命名為上下文LoRA(IC-LoRA)。這種方法不需要對原始DiT模型進行修改,只需要對訓練數據進行更改。顯著的是,我們的流程生成了更符合提示的高保真度圖像集。儘管在調整數據方面是任務特定的,但我們的框架在架構和流程上仍然是任務不可知的,為社區提供了一個強大的工具,並為進一步研究產品級任務不可知生成系統提供了有價值的見解。我們在 https://github.com/ali-vilab/In-Context-LoRA 釋出了我們的代碼、數據和模型。
English
Recent research arXiv:2410.15027 has explored the use of diffusion
transformers (DiTs) for task-agnostic image generation by simply concatenating
attention tokens across images. However, despite substantial computational
resources, the fidelity of the generated images remains suboptimal. In this
study, we reevaluate and streamline this framework by hypothesizing that
text-to-image DiTs inherently possess in-context generation capabilities,
requiring only minimal tuning to activate them. Through diverse task
experiments, we qualitatively demonstrate that existing text-to-image DiTs can
effectively perform in-context generation without any tuning. Building on this
insight, we propose a remarkably simple pipeline to leverage the in-context
abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint
captioning of multiple images, and (3) apply task-specific LoRA tuning using
small datasets (e.g., 20sim 100 samples) instead of full-parameter tuning
with large datasets. We name our models In-Context LoRA (IC-LoRA). This
approach requires no modifications to the original DiT models, only changes to
the training data. Remarkably, our pipeline generates high-fidelity image sets
that better adhere to prompts. While task-specific in terms of tuning data, our
framework remains task-agnostic in architecture and pipeline, offering a
powerful tool for the community and providing valuable insights for further
research on product-level task-agnostic generation systems. We release our
code, data, and models at https://github.com/ali-vilab/In-Context-LoRASummary
AI-Generated Summary