GPT-ImgEval:全面评估GPT4o图像生成能力的基准测试
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
April 3, 2025
作者: Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan
cs.AI
摘要
OpenAI的GPT4o模型近期取得的突破性进展,在图像生成与编辑方面展现出了令人惊喜的强大能力,引发了业界的广泛关注。本技术报告首次提出了名为GPT-ImgEval的评估基准,从定量与定性两个维度,对GPT-4o在三个关键领域的表现进行了全面诊断:(1)生成质量,(2)编辑能力,以及(3)基于世界知识的语义合成。在所有三项任务中,GPT-4o均表现出色,在图像生成控制与输出质量上显著超越现有方法,同时展示了卓越的知识推理能力。此外,基于GPT-4o生成的数据,我们提出了一种基于分类模型的方法来探究其底层架构,实证结果表明该模型采用了自回归(AR)与扩散模型相结合的头部进行图像解码,而非类似VAR的架构。我们还对GPT-4o的整体架构进行了完整推测。此外,我们开展了一系列分析,识别并可视化了GPT-4o在图像生成中的特定局限性和常见的合成伪影。我们还对比研究了GPT-4o与Gemini 2.0 Flash在多轮图像编辑上的表现,并探讨了GPT-4o输出的安全性问题,特别是现有图像取证模型对其的检测能力。我们期望本工作能为未来研究提供有价值的洞见,并建立一个可靠的基准,以促进图像生成及其他领域的可重复性研究与创新加速。用于评估GPT-4o的代码与数据集可在https://github.com/PicoTrex/GPT-ImgEval获取。
English
The recent breakthroughs in OpenAI's GPT4o model have demonstrated
surprisingly good capabilities in image generation and editing, resulting in
significant excitement in the community. This technical report presents the
first-look evaluation benchmark (named GPT-ImgEval), quantitatively and
qualitatively diagnosing GPT-4o's performance across three critical dimensions:
(1) generation quality, (2) editing proficiency, and (3) world
knowledge-informed semantic synthesis. Across all three tasks, GPT-4o
demonstrates strong performance, significantly surpassing existing methods in
both image generation control and output quality, while also showcasing
exceptional knowledge reasoning capabilities. Furthermore, based on the
GPT-4o's generated data, we propose a classification-model-based approach to
investigate the underlying architecture of GPT-4o, where our empirical results
suggest the model consists of an auto-regressive (AR) combined with a
diffusion-based head for image decoding, rather than the VAR-like
architectures. We also provide a complete speculation on GPT-4o's overall
architecture. In addition, we conduct a series of analyses to identify and
visualize GPT-4o's specific limitations and the synthetic artifacts commonly
observed in its image generation. We also present a comparative study of
multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the
safety implications of GPT-4o's outputs, particularly their detectability by
existing image forensic models. We hope that our work can offer valuable
insight and provide a reliable benchmark to guide future research, foster
reproducibility, and accelerate innovation in the field of image generation and
beyond. The codes and datasets used for evaluating GPT-4o can be found at
https://github.com/PicoTrex/GPT-ImgEval.Summary
AI-Generated Summary