WISE:面向文本到图像生成的世界知识引导语义评估
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
March 10, 2025
作者: Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, Li Yuan
cs.AI
摘要
文本到图像(T2I)模型能够生成高质量的艺术创作和视觉内容。然而,现有研究和评估标准主要集中于图像真实性和浅层次的文本-图像对齐,缺乏对文本到图像生成中复杂语义理解和世界知识整合的全面评估。为解决这一挑战,我们提出了WISE,这是首个专门为世界知识驱动的语义评估设计的基准。WISE超越了简单的词汇-像素映射,通过精心设计的1000个提示,涵盖文化常识、时空推理和自然科学等25个子领域,对模型进行挑战。为克服传统CLIP指标的局限性,我们引入了WiScore,一种新颖的定量指标,用于评估知识与图像的对齐。通过对20个模型(10个专用T2I模型和10个统一多模态模型)使用1000个结构化提示进行综合测试,我们的研究揭示了它们在图像生成过程中有效整合和应用世界知识的能力存在显著局限,为下一代T2I模型增强知识融入和应用指明了关键路径。代码和数据可在https://github.com/PKU-YuanGroup/WISE获取。
English
Text-to-Image (T2I) models are capable of generating high-quality artistic
creations and visual content. However, existing research and evaluation
standards predominantly focus on image realism and shallow text-image
alignment, lacking a comprehensive assessment of complex semantic understanding
and world knowledge integration in text to image generation. To address this
challenge, we propose WISE, the first benchmark specifically
designed for World Knowledge-Informed Semantic
Evaluation. WISE moves beyond simple word-pixel mapping by
challenging models with 1000 meticulously crafted prompts across 25 sub-domains
in cultural common sense, spatio-temporal reasoning, and natural science. To
overcome the limitations of traditional CLIP metric, we introduce
WiScore, a novel quantitative metric for assessing knowledge-image
alignment. Through comprehensive testing of 20 models (10 dedicated T2I models
and 10 unified multimodal models) using 1,000 structured prompts spanning 25
subdomains, our findings reveal significant limitations in their ability to
effectively integrate and apply world knowledge during image generation,
highlighting critical pathways for enhancing knowledge incorporation and
application in next-generation T2I models. Code and data are available at
https://github.com/PKU-YuanGroup/WISE.Summary
AI-Generated Summary