OmniGen: 통합 이미지 생성

초록

본 연구에서는 통합 이미지 생성을 위한 새로운 확산 모델인 OmniGen을 소개합니다. 인기 있는 확산 모델(예: Stable Diffusion)과 달리, OmniGen은 다양한 제어 조건을 처리하기 위해 ControlNet이나 IP-Adapter와 같은 추가 모듈이 더 이상 필요하지 않습니다. OmniGen은 다음과 같은 특징으로 특징 지어집니다: 1) 통합성: OmniGen은 텍스트에서 이미지 생성 능력을 보여주는 것뿐만 아니라 이미지 편집, 주체 주도 생성, 시각 조건부 생성과 같은 하류 작업을 내재적으로 지원합니다. 또한 OmniGen은 가장자리 검출과 인간 자세 인식과 같은 고전적인 컴퓨터 비전 작업을 이미지 생성 작업으로 변환하여 처리할 수 있습니다. 2) 간결성: OmniGen의 아키텍처는 매우 단순화되어 있어 추가 텍스트 인코더가 필요하지 않습니다. 또한 기존의 확산 모델과 비교하여 더 사용자 친화적이며, 복잡한 작업을 추가 전처리 단계(예: 인간 자세 추정) 없이 지시에 따라 수행할 수 있어 이미지 생성의 작업 흐름을 크게 간소화합니다. 3) 지식 전이: 통합 형식에서 학습함으로써 OmniGen은 효과적으로 다른 작업 간에 지식을 전이하고 보이지 않는 작업과 도메인을 관리하며 새로운 능력을 나타냅니다. 또한 모델의 추론 능력과 사고 체인 메커니즘의 잠재적 응용을 탐구합니다. 본 연구는 일반적인 이미지 생성 모델에 대한 첫 번째 시도를 대표하며, 여러 미해결 문제가 남아 있습니다. 우리는 해당 자원을 오픈 소스로 공개하여 이 분야의 발전을 촉진할 것입니다. (https://github.com/VectorSpaceLab/OmniGen)

English

In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.

OmniGen: 통합 이미지 생성

OmniGen: Unified Image Generation

초록

Summary

Support

Support