超越像素的想象：基于推理的视觉编辑基准测试

摘要

大型多模态模型（LMMs）在视觉理解和生成方面取得了显著进展，但在通用视觉编辑领域仍面临挑战，尤其是在遵循复杂指令、保持外观一致性以及支持灵活输入格式方面。为填补这一空白，我们推出了RISEBench，这是首个用于评估推理引导视觉编辑（RISE）的基准。RISEBench聚焦于四种关键推理类型：时序推理、因果推理、空间推理和逻辑推理。我们为每种类别精心策划了高质量测试案例，并提出了一个评估框架，该框架结合人类评审与LMM作为评审的方法，从指令推理、外观一致性和视觉合理性三个维度进行评估。实验表明，尽管GPT-4o-Native显著优于其他开源和专有模型，但即便是这一顶尖系统在逻辑推理任务上仍显吃力，凸显了该领域尚待深入探索。作为初步尝试，RISEBench旨在为推理感知的视觉编辑提供基础性洞见，并推动未来研究。尽管仍处于早期阶段，我们承诺将持续扩展和完善该基准，以支持对下一代多模态系统进行更全面、可靠和可扩展的评估。我们的代码和数据将在https://github.com/PhoenixZ810/RISEBench发布。

English

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.

超越像素的想象：基于推理的视觉编辑基准测试

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

摘要

Summary

Support

Support