확산 트랜스포머를 위한 인-컨텍스트 로라

초록

최근의 연구 arXiv:2410.15027에서는 확산 트랜스포머(Diffusion Transformers, DiTs)를 사용하여 간단히 이미지 간 어텐션 토큰을 연결함으로써 과제에 중립적인 이미지 생성을 탐구했습니다. 그러나 상당한 계산 자원에도 불구하고 생성된 이미지의 충실도는 최적이 아닙니다. 본 연구에서는 텍스트-이미지 DiTs가 본질적으로 맥락 내 생성 능력을 갖추고 있어, 활성화를 위해 최소한의 조정만 필요하다는 가설을 통해 이 프레임워크를 재평가하고 최적화했습니다. 다양한 과제 실험을 통해 기존의 텍스트-이미지 DiTs가 어떠한 조정 없이도 효과적으로 맥락 내 생성을 수행할 수 있음을 질적으로 증명했습니다. 이 통찰을 기반으로, DiTs의 맥락 내 능력을 활용하기 위한 매우 간단한 파이프라인을 제안합니다: (1) 토큰 대신 이미지를 연결, (2) 여러 이미지의 공동 캡션 작성, (3) 대규모 데이터셋을 사용한 전체 매개변수 조정 대신 소규모 데이터셋(예: 20~100개 샘플)을 사용하여 과제별 LoRA 조정을 수행합니다. 우리는 이러한 모델을 In-Context LoRA (IC-LoRA)라고 명명했습니다. 이 접근 방식은 원래 DiT 모델을 수정하지 않고 훈련 데이터만 변경하면 됩니다. 놀랍게도, 우리의 파이프라인은 프롬프트에 더 잘 부합하는 고품질 이미지 세트를 생성합니다. 튜닝 데이터에 대해서는 과제별이지만, 우리의 프레임워크는 아키텍처와 파이프라인에서 과제에 중립적이며, 커뮤니티에 강력한 도구를 제공하고 제품 수준의 과제에 중립적인 생성 시스템에 대한 추가 연구에 유용한 통찰을 제공합니다. 우리는 코드, 데이터 및 모델을 https://github.com/ali-vilab/In-Context-LoRA에서 공개합니다.

English

Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., 20sim 100 samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at https://github.com/ali-vilab/In-Context-LoRA

확산 트랜스포머를 위한 인-컨텍스트 로라

In-Context LoRA for Diffusion Transformers

초록

Support