交錯式場景圖用於交錯式文本和圖像生成的評估
Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment
November 26, 2024
作者: Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, Ranjay Krishna
cs.AI
摘要
許多現實世界的使用者查詢(例如:"如何製作蛋炒飯?")可以從能夠生成具有文字步驟和相應圖像的回應系統中受益,類似於烹飪書。設計用於生成交錯文本和圖像的模型在確保這些模態內部和之間的一致性方面面臨挑戰。為了應對這些挑戰,我們提出了ISG,一個用於交錯文本和圖像生成的全面評估框架。ISG利用場景圖結構來捕捉文本和圖像區塊之間的關係,通過四個級別的粒度進行評估:整體、結構、區塊級和圖像特定。這種多層次的評估允許對一致性、連貫性和準確性進行微妙評估,並提供可解釋的問答反饋。除了ISG,我們還引入了一個基準,ISG-Bench,包括8個類別和21個子類別的1,150個樣本。這個基準數據集包含複雜的語言-視覺依賴關係和黃金答案,以有效評估模型在視覺中心任務上的表現,例如風格轉換,這是當前模型中具有挑戰性的領域。使用ISG-Bench,我們展示了最近的統一視覺-語言模型在生成交錯內容方面表現不佳。儘管組合方法結合了獨立的語言和圖像模型在整體水平上比統一模型提高了111%,但它們在區塊和圖像級別的表現仍然不理想。為了促進未來的工作,我們開發了ISG-Agent,一個基線代理,採用“計劃-執行-優化”流程來調用工具,實現了122%的性能改善。
English
Many real-world user queries (e.g. "How do to make egg fried rice?") could
benefit from systems capable of generating responses with both textual steps
with accompanying images, similar to a cookbook. Models designed to generate
interleaved text and images face challenges in ensuring consistency within and
across these modalities. To address these challenges, we present ISG, a
comprehensive evaluation framework for interleaved text-and-image generation.
ISG leverages a scene graph structure to capture relationships between text and
image blocks, evaluating responses on four levels of granularity: holistic,
structural, block-level, and image-specific. This multi-tiered evaluation
allows for a nuanced assessment of consistency, coherence, and accuracy, and
provides interpretable question-answer feedback. In conjunction with ISG, we
introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8
categories and 21 subcategories. This benchmark dataset includes complex
language-vision dependencies and golden answers to evaluate models effectively
on vision-centric tasks such as style transfer, a challenging area for current
models. Using ISG-Bench, we demonstrate that recent unified vision-language
models perform poorly on generating interleaved content. While compositional
approaches that combine separate language and image models show a 111%
improvement over unified models at the holistic level, their performance
remains suboptimal at both block and image levels. To facilitate future work,
we develop ISG-Agent, a baseline agent employing a "plan-execute-refine"
pipeline to invoke tools, achieving a 122% performance improvement.Summary
AI-Generated Summary