IMAGINE-E: 최첨단 텍스트-이미지 모델의 이미지 생성 지능 평가

초록

확산 모델의 급속한 발전으로 텍스트-이미지(T2I) 모델은 상당한 진전을 이루어 빠른 추종과 이미지 생성에서 놀라운 능력을 보여주고 있다. 최근 출시된 FLUX.1 및 Ideogram2.0과 Dall-E3, Stable Diffusion 3과 같은 다른 모델들은 다양한 복잡한 작업에서 우수한 성능을 나타내며, T2I 모델이 일반용도 적용으로 나아가고 있는지에 대한 의문을 던지고 있다. 전통적인 이미지 생성을 넘어, 이러한 모델들은 조절 가능한 생성, 이미지 편집, 비디오, 오디오, 3D 및 동작 생성, 그리고 시맨틱 분할 및 깊이 추정과 같은 컴퓨터 비전 작업에서 능력을 나타내고 있다. 그러나 현재의 평가 프레임워크는 이러한 모델들의 성능을 철저히 평가하기에는 부족하다. 이러한 모델들을 철저히 평가하기 위해, 우리는 IMAGINE-E를 개발하고 FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, 그리고 Jimeng 등 6가지 주요 모델을 테스트했다. 우리의 평가는 구조화된 출력 생성, 현실성 및 물리적 일관성, 특정 도메인 생성, 도전적인 시나리오 생성, 그리고 다양한 스타일 생성 작업으로 나뉘었다. 이 포괄적인 평가는 각 모델의 장단점을 강조하며, 특히 FLUX.1과 Ideogram2.0의 구조화된 및 특정 도메인 작업에서의 우수한 성능을 강조하여 T2I 모델의 확장되는 응용 및 잠재력을 강조하고 있다. 본 연구는 T2I 모델이 일반용도 사용성으로 발전함에 따른 현재 상태와 미래 궤적에 대한 소중한 통찰력을 제공한다. 평가 스크립트는 https://github.com/jylei16/Imagine-e에서 공개될 예정이다.

English

With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.

IMAGINE-E: 최첨단 텍스트-이미지 모델의 이미지 생성 지능 평가

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

초록

Support