ChatPaper.aiChatPaper

我们是否已实现图像生成与理解的统一?对GPT-4o图像生成能力的实证研究

Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability

April 9, 2025
作者: Ning Li, Jingran Zhang, Justin Cui
cs.AI

摘要

OpenAI的多模态GPT-4o在图像生成与编辑方面展现了卓越能力,但其实现基于世界知识的语义合成——即无缝整合领域知识、上下文推理与指令遵循——的能力尚未得到证实。本研究从三个关键维度系统评估了这些能力:(1)全局指令遵循,(2)细粒度编辑精度,以及(3)生成后推理。尽管现有基准测试凸显了GPT-4o在图像生成与编辑上的强大性能,我们的评估却揭示了GPT-4o的持续局限:该模型常倾向于对指令进行字面解读,知识约束的应用不一致,且在条件推理任务上表现挣扎。这些发现挑战了关于GPT-4o统一理解与生成能力的普遍假设,暴露了其在动态知识整合上的显著不足。本研究呼吁开发超越表面一致性的更强健基准与训练策略,强调基于上下文感知与推理的多模态生成。
English
OpenAI's multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis--seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence--remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o's strong capabilities in image generation and editing, our evaluation reveals GPT-4o's persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o's unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.

Summary

AI-Generated Summary

PDF472April 15, 2025