이미지 조건부 확산 모델의 세밀 조정은 생각보다 쉽습니다.

초록

최근 연구에서는 대형 확산 모델이 깊이 추정을 이미지 조건부 이미지 생성 작업으로 캐스팅함으로써 매우 정확한 단안 깊이 추정기로 재사용될 수 있다는 것을 보여 주었습니다. 제안된 모델은 최첨단 결과를 달성했지만, 다단계 추론으로 인한 높은 계산 요구로 인해 많은 시나리오에서 사용이 제한되었습니다. 본 논문에서는 지금까지 눈에 띄지 않았던 추론 파이프라인의 결함으로 인해 인식된 비효율성이 발생했음을 보여줍니다. 수정된 모델은 이전에 보고된 최상의 구성과 비교 가능한 성능을 발휘하면서도 200배 이상 빠릅니다. 하류 작업 성능을 최적화하기 위해 우리는 작업 특정 손실을 사용하여 단계별 모델 위에 엔드-투-엔드 미세 조정을 수행하고, 일반적인 제로샷 벤치마크에서 모든 다른 확산 기반 깊이 및 법선 추정 모델을 능가하는 결정론적 모델을 얻습니다. 놀랍게도, 이 미세 조정 프로토콜은 안정적인 확산에서도 직접 작동하며, 현재 최첨단 확산 기반 깊이 및 법선 추정 모델과 유사한 성능을 달성하며, 이전 연구에서 도출된 일부 결론에 의문을 제기합니다.

English

Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200times faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.

이미지 조건부 확산 모델의 세밀 조정은 생각보다 쉽습니다.

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

초록

Summary

Support

Support