Lotus:基於擴散的視覺基礎模型,用於高質量密集預測
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
September 26, 2024
作者: Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, Ying-Cong Chen
cs.AI
摘要
利用預先訓練的文本到圖像擴散模型的視覺先驗,為增強密集預測任務中的零樣本泛化提供了一個有前途的解決方案。然而,現有方法通常未經批判地使用原始的擴散公式,這可能不是最佳的,因為密集預測和圖像生成之間存在根本差異。本文對密集預測的擴散公式進行系統分析,著重於質量和效率。我們發現,用於圖像生成的原始參數化類型,即學習預測噪聲的方法對於密集預測是有害的;多步驟的加噪/去噪擴散過程也是不必要的且難以優化。基於這些見解,我們引入了Lotus,一個基於擴散的視覺基礎模型,具有簡單而有效的適應協議,用於密集預測。具體而言,Lotus被訓練來直接預測標註而不是噪聲,從而避免有害的變異。我們還重新制定了擴散過程,使其成為一個單步驟程序,簡化了優化過程並顯著提高了推理速度。此外,我們引入了一種稱為“細節保留者”的新調整策略,實現更準確和細緻的預測。在不擴大訓練數據或模型容量的情況下,Lotus在各種數據集上實現了零樣本深度和法向估計的最先進性能。它還顯著提高了效率,比大多數現有的基於擴散的方法快數百倍。
English
Leveraging the visual priors of pre-trained text-to-image diffusion models
offers a promising solution to enhance zero-shot generalization in dense
prediction tasks. However, existing methods often uncritically use the original
diffusion formulation, which may not be optimal due to the fundamental
differences between dense prediction and image generation. In this paper, we
provide a systemic analysis of the diffusion formulation for the dense
prediction, focusing on both quality and efficiency. And we find that the
original parameterization type for image generation, which learns to predict
noise, is harmful for dense prediction; the multi-step noising/denoising
diffusion process is also unnecessary and challenging to optimize. Based on
these insights, we introduce Lotus, a diffusion-based visual foundation model
with a simple yet effective adaptation protocol for dense prediction.
Specifically, Lotus is trained to directly predict annotations instead of
noise, thereby avoiding harmful variance. We also reformulate the diffusion
process into a single-step procedure, simplifying optimization and
significantly boosting inference speed. Additionally, we introduce a novel
tuning strategy called detail preserver, which achieves more accurate and
fine-grained predictions. Without scaling up the training data or model
capacity, Lotus achieves SoTA performance in zero-shot depth and normal
estimation across various datasets. It also significantly enhances efficiency,
being hundreds of times faster than most existing diffusion-based methods.Summary
AI-Generated Summary