将生成式去噪与判别式目标对齐,释放扩散模型在视觉感知中的潜力
Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
April 15, 2025
作者: Ziqi Pang, Xin Xu, Yu-Xiong Wang
cs.AI
摘要
随着图像生成的成功,生成扩散模型正越来越多地被应用于判别任务,因为像素生成提供了一个统一的感知接口。然而,直接将生成去噪过程重新用于判别目标,暴露出之前鲜少被关注的关键差距。生成模型能够容忍中间采样误差,只要最终分布保持合理即可,但判别任务则要求整个过程严格准确,这在具有挑战性的多模态任务(如参考图像分割)中尤为明显。受此差距启发,我们分析并增强了生成扩散过程与感知任务之间的对齐,重点关注去噪过程中感知质量如何演变。我们发现:(1)早期的去噪步骤对感知质量的贡献不成比例,促使我们提出反映不同时间步贡献的定制学习目标;(2)后期的去噪步骤显示出意外的感知质量下降,突显了对训练-去噪分布偏移的敏感性,我们通过专门为扩散设计的数据增强来解决这一问题;(3)生成过程独特地实现了交互性,作为可控的用户界面,能够适应多轮交互中的校正提示。我们的见解显著提升了基于扩散的感知模型性能,无需改变架构,在深度估计、参考图像分割及通用感知任务上达到了最先进的水平。代码可在https://github.com/ziqipang/ADDP获取。
English
With the success of image generation, generative diffusion models are
increasingly adopted for discriminative tasks, as pixel generation provides a
unified perception interface. However, directly repurposing the generative
denoising process for discriminative objectives reveals critical gaps rarely
addressed previously. Generative models tolerate intermediate sampling errors
if the final distribution remains plausible, but discriminative tasks require
rigorous accuracy throughout, as evidenced in challenging multi-modal tasks
like referring image segmentation. Motivated by this gap, we analyze and
enhance alignment between generative diffusion processes and perception tasks,
focusing on how perception quality evolves during denoising. We find: (1)
earlier denoising steps contribute disproportionately to perception quality,
prompting us to propose tailored learning objectives reflecting varying
timestep contributions; (2) later denoising steps show unexpected perception
degradation, highlighting sensitivity to training-denoising distribution
shifts, addressed by our diffusion-tailored data augmentation; and (3)
generative processes uniquely enable interactivity, serving as controllable
user interfaces adaptable to correctional prompts in multi-round interactions.
Our insights significantly improve diffusion-based perception models without
architectural changes, achieving state-of-the-art performance on depth
estimation, referring image segmentation, and generalist perception tasks. Code
available at https://github.com/ziqipang/ADDP.Summary
AI-Generated Summary