GPT-4o圖像生成能力的實證研究
An Empirical Study of GPT-4o Image Generation Capabilities
April 8, 2025
作者: Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi
cs.AI
摘要
圖像生成領域已迅速演進,從早期的基於GAN的方法,到擴散模型,再到最近試圖橋接理解與生成任務的統一生成架構。特別是GPT-4o的最新進展,展示了高保真多模態生成的可行性,但其架構設計仍保持神秘且未公開。這引發了一個問題:圖像和文本生成是否已成功整合到這些方法的統一框架中。在本研究中,我們對GPT-4o的圖像生成能力進行了實證研究,並將其與領先的開源和商業模型進行基準測試。我們的評估涵蓋了四大類別,包括文本到圖像、圖像到圖像、圖像到3D以及圖像到X的生成,涉及超過20項任務。我們的分析突顯了GPT-4o在各種設置下的優勢與限制,並將其置於生成模型更廣泛的演進背景中。通過這項調查,我們為未來統一生成模型識別了有前景的方向,強調了架構設計和數據擴展的作用。
English
The landscape of image generation has rapidly evolved, from early GAN-based
approaches to diffusion models and, most recently, to unified generative
architectures that seek to bridge understanding and generation tasks. Recent
advances, especially the GPT-4o, have demonstrated the feasibility of
high-fidelity multimodal generation, their architectural design remains
mysterious and unpublished. This prompts the question of whether image and text
generation have already been successfully integrated into a unified framework
for those methods. In this work, we conduct an empirical study of GPT-4o's
image generation capabilities, benchmarking it against leading open-source and
commercial models. Our evaluation covers four main categories, including
text-to-image, image-to-image, image-to-3D, and image-to-X generation, with
more than 20 tasks. Our analysis highlights the strengths and limitations of
GPT-4o under various settings, and situates it within the broader evolution of
generative modeling. Through this investigation, we identify promising
directions for future unified generative models, emphasizing the role of
architectural design and data scaling.Summary
AI-Generated Summary