GATE開放：一個全面的基準測試，用於評估開放式交錯的圖像-文本生成

摘要

多模式大型語言模型（MLLMs）在視覺理解和生成任務方面取得了顯著進展。然而，生成交錯的圖像-文本內容仍然是一個挑戰，這需要整合的多模式理解和生成能力。儘管統一模型的進展提供了新的解決方案，但由於數據大小和多樣性的限制，現有的基準測試不足以評估這些方法。為了彌合這一差距，我們介紹了GATE OpenING（OpenING），這是一個包含5,400個高質量人工標註實例的全面基準測試，涵蓋了56個現實世界任務。OpenING涵蓋了各種日常情境，如旅遊指南、設計和腦力激盪，為具有挑戰性的交錯生成方法提供了一個強大的平台。此外，我們提出了IntJudge，一個用於評估開放式多模式生成方法的評判模型。通過使用一個新穎的數據管道進行訓練，我們的IntJudge與人類判斷達成了82.42%的一致率，比基於GPT的評估者高出11.34%。對OpenING的大量實驗顯示，當前的交錯生成方法仍有很大的改進空間。關於交錯的圖像-文本生成的關鍵發現進一步提供，以指導下一代模型的發展。OpenING的開源代碼位於https://opening.github.io。

English

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce GATE OpenING (OpenING), a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The OpenING is open-sourced at https://opening.github.io.

GATE開放：一個全面的基準測試，用於評估開放式交錯的圖像-文本生成

GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

摘要

Support