ChatPaper.aiChatPaper

我們是否已經統一了圖像生成與理解?對GPT-4o圖像生成能力的實證研究

Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability

April 9, 2025
作者: Ning Li, Jingran Zhang, Justin Cui
cs.AI

摘要

OpenAI的多模態GPT-4o在圖像生成與編輯方面展現了卓越的能力,然而其在實現基於世界知識的語義合成——無縫整合領域知識、上下文推理與指令遵循——方面的能力仍有待驗證。在本研究中,我們系統性地評估了這些能力在三個關鍵維度上的表現:(1) 全局指令遵循,(2) 細粒度編輯精度,以及(3) 生成後推理。儘管現有的基準測試凸顯了GPT-4o在圖像生成與編輯方面的強大能力,我們的評估卻揭示了GPT-4o的持續性局限:該模型經常默認對指令進行字面解讀,不一致地應用知識約束,並在條件推理任務中表現掙扎。這些發現挑戰了關於GPT-4o統一理解與生成能力的普遍假設,暴露了其在動態知識整合方面的顯著差距。我們的研究呼籲開發更為穩健的基準測試與訓練策略,超越表面層面的對齊,強調基於上下文感知與推理的多模態生成。
English
OpenAI's multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis--seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence--remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o's strong capabilities in image generation and editing, our evaluation reveals GPT-4o's persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o's unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.

Summary

AI-Generated Summary

PDF421April 15, 2025