复杂编辑:面向复杂度可控图像编辑基准的类CoT指令生成
Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark
April 17, 2025
作者: Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie
cs.AI
摘要
我们推出了Complex-Edit,这是一个旨在系统评估基于指令的图像编辑模型在不同复杂度指令下表现的综合性基准。为构建此基准,我们利用GPT-4o自动大规模收集多样化的编辑指令。我们的方法遵循一个结构化的“编辑链”流程:首先生成独立的原子级编辑任务,随后将其整合形成连贯的复杂指令。此外,我们引入了一套评估编辑性能多方面的指标,以及一个支持大规模评估的基于视觉语言模型(VLM)的自动评估流程。我们的基准揭示了几个重要发现:1)开源模型相较于闭源专有模型表现显著落后,且随着指令复杂度的增加,性能差距进一步扩大;2)指令复杂度的提升主要削弱了模型保留输入图像关键元素及维持整体美学质量的能力;3)将复杂指令分解为一系列原子步骤并按步执行,会显著降低多项指标上的表现;4)简单的Best-of-N选择策略对直接编辑和分步顺序方法均能提升效果;5)我们观察到“合成数据诅咒”:当模型训练涉及合成数据时,随着编辑指令复杂度的增加,这些模型生成的编辑图像倾向于显得愈发合成化——这一现象在最新的GPT-4o输出中也同样有趣地显现。
English
We introduce Complex-Edit, a comprehensive benchmark designed to
systematically evaluate instruction-based image editing models across
instructions of varying complexity. To develop this benchmark, we harness
GPT-4o to automatically collect a diverse set of editing instructions at scale.
Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first
generate individual atomic editing tasks independently and then integrate them
to form cohesive, complex instructions. Additionally, we introduce a suite of
metrics to assess various aspects of editing performance, along with a
VLM-based auto-evaluation pipeline that supports large-scale assessments. Our
benchmark yields several notable insights: 1) Open-source models significantly
underperform relative to proprietary, closed-source models, with the
performance gap widening as instruction complexity increases; 2) Increased
instructional complexity primarily impairs the models' ability to retain key
elements from the input images and to preserve the overall aesthetic quality;
3) Decomposing a complex instruction into a sequence of atomic steps, executed
in a step-by-step manner, substantially degrades performance across multiple
metrics; 4) A straightforward Best-of-N selection strategy improves results for
both direct editing and the step-by-step sequential approach; and 5) We observe
a ``curse of synthetic data'': when synthetic data is involved in model
training, the edited images from such models tend to appear increasingly
synthetic as the complexity of the editing instructions rises -- a phenomenon
that intriguingly also manifests in the latest GPT-4o outputs.Summary
AI-Generated Summary