复杂编辑：面向复杂度可控图像编辑基准的类CoT指令生成

摘要

我们推出了Complex-Edit，这是一个旨在系统评估基于指令的图像编辑模型在不同复杂度指令下表现的综合性基准。为构建此基准，我们利用GPT-4o自动大规模收集多样化的编辑指令。我们的方法遵循一个结构化的“编辑链”流程：首先生成独立的原子级编辑任务，随后将其整合形成连贯的复杂指令。此外，我们引入了一套评估编辑性能多方面的指标，以及一个支持大规模评估的基于视觉语言模型（VLM）的自动评估流程。我们的基准揭示了几个重要发现：1）开源模型相较于闭源专有模型表现显著落后，且随着指令复杂度的增加，性能差距进一步扩大；2）指令复杂度的提升主要削弱了模型保留输入图像关键元素及维持整体美学质量的能力；3）将复杂指令分解为一系列原子步骤并按步执行，会显著降低多项指标上的表现；4）简单的Best-of-N选择策略对直接编辑和分步顺序方法均能提升效果；5）我们观察到“合成数据诅咒”：当模型训练涉及合成数据时，随着编辑指令复杂度的增加，这些模型生成的编辑图像倾向于显得愈发合成化——这一现象在最新的GPT-4o输出中也同样有趣地显现。

English

We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.

复杂编辑：面向复杂度可控图像编辑基准的类CoT指令生成

Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

摘要

Summary

Support

Support