Complex-Edit:面向複雜度可控圖像編輯基準的類CoT指令生成
Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark
April 17, 2025
作者: Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie
cs.AI
摘要
我們推出了Complex-Edit,這是一個全面的基準測試,旨在系統性地評估基於指令的圖像編輯模型在不同複雜度指令下的表現。為構建此基準,我們利用GPT-4o大規模自動收集多樣化的編輯指令集。我們的方法遵循一個結構化的“編輯鏈”流程:首先獨立生成單個原子級編輯任務,然後將其整合形成連貫的複雜指令。此外,我們引入了一套評估指標來衡量編輯性能的各個方面,並配備了一個基於視覺語言模型的自動評估管道,以支持大規模評估。我們的基準測試揭示了幾個重要發現:1)開源模型相較於專有的閉源模型表現顯著落後,且隨著指令複雜度增加,性能差距進一步擴大;2)指令複雜度的提升主要損害了模型保留輸入圖像關鍵元素及維持整體美學質量的能力;3)將複雜指令分解為一系列原子步驟並逐步執行,會在多個指標上顯著降低性能;4)簡單的Best-of-N選擇策略無論是對直接編輯還是逐步順序方法都能提升結果;5)我們觀察到“合成數據的詛咒”:當模型訓練涉及合成數據時,隨著編輯指令複雜度的提升,這些模型生成的編輯圖像趨向於呈現出越來越多的合成痕跡——這一現象有趣地也在最新的GPT-4o輸出中顯現。
English
We introduce Complex-Edit, a comprehensive benchmark designed to
systematically evaluate instruction-based image editing models across
instructions of varying complexity. To develop this benchmark, we harness
GPT-4o to automatically collect a diverse set of editing instructions at scale.
Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first
generate individual atomic editing tasks independently and then integrate them
to form cohesive, complex instructions. Additionally, we introduce a suite of
metrics to assess various aspects of editing performance, along with a
VLM-based auto-evaluation pipeline that supports large-scale assessments. Our
benchmark yields several notable insights: 1) Open-source models significantly
underperform relative to proprietary, closed-source models, with the
performance gap widening as instruction complexity increases; 2) Increased
instructional complexity primarily impairs the models' ability to retain key
elements from the input images and to preserve the overall aesthetic quality;
3) Decomposing a complex instruction into a sequence of atomic steps, executed
in a step-by-step manner, substantially degrades performance across multiple
metrics; 4) A straightforward Best-of-N selection strategy improves results for
both direct editing and the step-by-step sequential approach; and 5) We observe
a ``curse of synthetic data'': when synthetic data is involved in model
training, the edited images from such models tend to appear increasingly
synthetic as the complexity of the editing instructions rises -- a phenomenon
that intriguingly also manifests in the latest GPT-4o outputs.Summary
AI-Generated Summary