交错场景图用于交错文本和图像生成的评估

摘要

许多真实世界用户查询（例如“如何制作蛋炒饭？”）可以从能够生成带有文本步骤和相应图像的响应系统中受益，类似于烹饪书。设计用于生成交错文本和图像的模型在确保这些模态内部和之间的一致性方面面临挑战。为了解决这些挑战，我们提出了ISG，一个用于交错文本和图像生成的全面评估框架。ISG利用场景图结构来捕捉文本和图像块之间的关系，评估响应的四个粒度级别：整体、结构、块级和图像特定。这种多层次评估允许对一致性、连贯性和准确性进行微妙评估，并提供可解释的问答反馈。结合ISG，我们引入了一个基准，ISG-Bench，涵盖了8个类别和21个子类别的1,150个样本。这个基准数据集包括复杂的语言-视觉依赖关系和黄金答案，有效评估模型在视觉中心任务上的表现，如风格转移，这是当前模型面临挑战的领域。使用ISG-Bench，我们展示了最近的统一视觉-语言模型在生成交错内容方面表现不佳。尽管组合方法将独立的语言和图像模型结合在一起在整体水平上比统一模型提高了111%，但它们在块和图像级别的表现仍然不理想。为了促进未来的工作，我们开发了ISG-Agent，一个基线代理，采用“计划-执行-优化”流水线来调用工具，实现了122%的性能提升。

English

Many real-world user queries (e.g. "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Using ISG-Bench, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-Agent, a baseline agent employing a "plan-execute-refine" pipeline to invoke tools, achieving a 122% performance improvement.

交错场景图用于交错文本和图像生成的评估

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

摘要

Summary

Support

Support