GPT-4o图像生成能力的实证研究

摘要

图像生成领域已迅速发展，从早期的基于GAN的方法到扩散模型，再到最近寻求桥接理解与生成任务的统一生成架构。尤其是GPT-4o的最新进展，展示了高保真多模态生成的可行性，但其架构设计仍神秘未公开。这引发了一个问题：图像与文本生成是否已成功整合到这些方法的统一框架中。在本研究中，我们对GPT-4o的图像生成能力进行了实证分析，将其与领先的开源和商业模型进行基准测试。我们的评估涵盖四大类别，包括文本到图像、图像到图像、图像到3D以及图像到X的生成，涉及超过20项任务。通过分析，我们揭示了GPT-4o在不同设置下的优势与局限，并将其置于生成模型更广泛的演进背景中。通过这一探索，我们为未来统一生成模型指明了有前景的方向，强调了架构设计与数据扩展的重要性。

English

The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.

GPT-4o图像生成能力的实证研究

An Empirical Study of GPT-4o Image Generation Capabilities

摘要

Summary

Support

Support