交錯式場景圖用於交錯式文本和圖像生成的評估

摘要

許多現實世界的使用者查詢（例如："如何製作蛋炒飯？"）可以從能夠生成具有文字步驟和相應圖像的回應系統中受益，類似於烹飪書。設計用於生成交錯文本和圖像的模型在確保這些模態內部和之間的一致性方面面臨挑戰。為了應對這些挑戰，我們提出了ISG，一個用於交錯文本和圖像生成的全面評估框架。ISG利用場景圖結構來捕捉文本和圖像區塊之間的關係，通過四個級別的粒度進行評估：整體、結構、區塊級和圖像特定。這種多層次的評估允許對一致性、連貫性和準確性進行微妙評估，並提供可解釋的問答反饋。除了ISG，我們還引入了一個基準，ISG-Bench，包括8個類別和21個子類別的1,150個樣本。這個基準數據集包含複雜的語言-視覺依賴關係和黃金答案，以有效評估模型在視覺中心任務上的表現，例如風格轉換，這是當前模型中具有挑戰性的領域。使用ISG-Bench，我們展示了最近的統一視覺-語言模型在生成交錯內容方面表現不佳。儘管組合方法結合了獨立的語言和圖像模型在整體水平上比統一模型提高了111％，但它們在區塊和圖像級別的表現仍然不理想。為了促進未來的工作，我們開發了ISG-Agent，一個基線代理，採用“計劃-執行-優化”流程來調用工具，實現了122％的性能改善。

English

Many real-world user queries (e.g. "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Using ISG-Bench, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-Agent, a baseline agent employing a "plan-execute-refine" pipeline to invoke tools, achieving a 122% performance improvement.

交錯式場景圖用於交錯式文本和圖像生成的評估

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

摘要

Support