GPT-4o图像生成能力的实证研究
An Empirical Study of GPT-4o Image Generation Capabilities
April 8, 2025
作者: Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi
cs.AI
摘要
图像生成领域已迅速发展,从早期的基于GAN的方法到扩散模型,再到最近寻求桥接理解与生成任务的统一生成架构。尤其是GPT-4o的最新进展,展示了高保真多模态生成的可行性,但其架构设计仍神秘未公开。这引发了一个问题:图像与文本生成是否已成功整合到这些方法的统一框架中。在本研究中,我们对GPT-4o的图像生成能力进行了实证分析,将其与领先的开源和商业模型进行基准测试。我们的评估涵盖四大类别,包括文本到图像、图像到图像、图像到3D以及图像到X的生成,涉及超过20项任务。通过分析,我们揭示了GPT-4o在不同设置下的优势与局限,并将其置于生成模型更广泛的演进背景中。通过这一探索,我们为未来统一生成模型指明了有前景的方向,强调了架构设计与数据扩展的重要性。
English
The landscape of image generation has rapidly evolved, from early GAN-based
approaches to diffusion models and, most recently, to unified generative
architectures that seek to bridge understanding and generation tasks. Recent
advances, especially the GPT-4o, have demonstrated the feasibility of
high-fidelity multimodal generation, their architectural design remains
mysterious and unpublished. This prompts the question of whether image and text
generation have already been successfully integrated into a unified framework
for those methods. In this work, we conduct an empirical study of GPT-4o's
image generation capabilities, benchmarking it against leading open-source and
commercial models. Our evaluation covers four main categories, including
text-to-image, image-to-image, image-to-3D, and image-to-X generation, with
more than 20 tasks. Our analysis highlights the strengths and limitations of
GPT-4o under various settings, and situates it within the broader evolution of
generative modeling. Through this investigation, we identify promising
directions for future unified generative models, emphasizing the role of
architectural design and data scaling.Summary
AI-Generated Summary