用於擴散Transformer的上下文LoRA

摘要

最近的研究 arXiv:2410.15027 探索了擴散Transformer（DiTs）在任務不可知的圖像生成中的應用，方法是通過簡單地將注意力標記串聯在圖像之間。然而，儘管使用了大量計算資源，生成的圖像保真度仍然不理想。在這項研究中，我們重新評估並精簡了這個框架，假設文本到圖像的DiTs本質上具有上下文生成能力，只需要進行最小程度的調整來激活它們。通過多樣的任務實驗，我們在質量上展示了現有的文本到圖像DiTs可以有效地執行上下文生成而無需任何調整。基於這一見解，我們提出了一個非常簡單的流程來利用DiTs的上下文能力：（1）串聯圖像而不是標記，（2）對多個圖像進行聯合標註，（3）使用小數據集（例如20至100個樣本）進行任務特定的LoRA調整，而不是使用大數據集進行全參數調整。我們將我們的模型命名為上下文LoRA（IC-LoRA）。這種方法不需要對原始DiT模型進行修改，只需要對訓練數據進行更改。顯著的是，我們的流程生成了更符合提示的高保真度圖像集。儘管在調整數據方面是任務特定的，但我們的框架在架構和流程上仍然是任務不可知的，為社區提供了一個強大的工具，並為進一步研究產品級任務不可知生成系統提供了有價值的見解。我們在 https://github.com/ali-vilab/In-Context-LoRA 釋出了我們的代碼、數據和模型。

English

Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., 20sim 100 samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at https://github.com/ali-vilab/In-Context-LoRA

用於擴散Transformer的上下文LoRA

In-Context LoRA for Diffusion Transformers

摘要

Summary

Support

Support