ChatPaper.aiChatPaper

DICEPTION:面向视觉感知任务的通用扩散模型

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

February 24, 2025
作者: Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
cs.AI

摘要

我们的核心目标是构建一个优秀的通用感知模型,在计算资源和训练数据受限的条件下,能够处理多种任务。为此,我们采用了基于数十亿图像预训练的文本到图像扩散模型。通过全面的评估指标,我们证明了DICEPTION在处理多种感知任务时表现出色,其性能与当前最先进的模型相当。仅使用SAM-vit-h模型0.06%的数据(例如,60万与10亿像素级标注图像对比),我们便取得了与之相当的结果。受Wang等人启发,DICEPTION采用色彩编码来表示各类感知任务的输出;我们展示了对不同实例随机分配颜色的策略,在实体分割和语义分割中均极为有效。将多种感知任务统一为条件图像生成,使我们能够充分利用预训练的文本到图像模型。因此,与从头训练的传统模型相比,DICEPTION能以低几个数量级的成本高效训练。在将模型适配到其他任务时,仅需对少至50张图像和1%的参数进行微调。DICEPTION为视觉通用模型提供了宝贵洞见和更具前景的解决方案。
English
Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models.

Summary

AI-Generated Summary

PDF513February 25, 2025