ChatPaper.aiChatPaper

微調圖像條件擴散模型比你想像的更容易。

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

September 17, 2024
作者: Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe
cs.AI

摘要

最近的研究表明,大型擴散模型可以通過將深度估計視為圖像條件圖像生成任務,被重新利用為高精度的單眼深度估計器。雖然所提出的模型取得了最先進的結果,但由於多步推理所帶來的高計算需求限制了其在許多場景中的應用。在本文中,我們展示了感知到的低效是由於推理流程中的一個缺陷所致,這個缺陷到目前為止一直未被注意到。修正後的模型在性能上與先前報告的最佳配置相當,同時速度快了200多倍。為了優化下游任務的性能,我們在單步模型上進行端對端微調,使用特定任務損失,得到一個勝過所有其他基於擴散的深度和法向估計模型的確定性模型,在常見的零樣本基準測試中表現優異。我們驚訝地發現,這種微調協議也直接適用於穩定擴散,並實現了與當前最先進的基於擴散的深度和法向估計模型相當的性能,這對先前研究中得出的一些結論提出了質疑。
English
Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200times faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.

Summary

AI-Generated Summary

PDF312November 16, 2024