生成它们的一种扩散
One Diffusion to Generate Them All
November 25, 2024
作者: Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu
cs.AI
摘要
我们介绍了OneDiffusion,这是一个多功能的大规模扩散模型,无缝支持跨多种任务的双向图像合成和理解。它能够从诸如文本、深度、姿势、布局和语义地图等输入条件生成图像,同时处理诸如图像去模糊、放大以及深度估计和分割等逆向过程。此外,OneDiffusion还支持多视角生成、摄像机姿势估计,以及使用序列图像输入进行即时个性化。我们的模型采用了简单而有效的方法,将所有任务视为在训练过程中具有不同噪声尺度的帧序列,从而允许任何帧在推断时充当条件图像。我们的统一训练框架消除了对专门架构的需求,支持可扩展的多任务训练,并且能够平滑地适应任何分辨率,提高了泛化能力和可扩展性。实验结果表明,尽管训练数据集相对较小,我们的模型在生成和预测任务中表现出色,如文本到图像、多视角生成、ID 保留、深度估计和摄像机姿势估计等。我们的代码和检查点可以在 https://github.com/lehduong/OneDiffusion 免费获取。
English
We introduce OneDiffusion, a versatile, large-scale diffusion model that
seamlessly supports bidirectional image synthesis and understanding across
diverse tasks. It enables conditional generation from inputs such as text,
depth, pose, layout, and semantic maps, while also handling tasks like image
deblurring, upscaling, and reverse processes such as depth estimation and
segmentation. Additionally, OneDiffusion allows for multi-view generation,
camera pose estimation, and instant personalization using sequential image
inputs. Our model takes a straightforward yet effective approach by treating
all tasks as frame sequences with varying noise scales during training,
allowing any frame to act as a conditioning image at inference time. Our
unified training framework removes the need for specialized architectures,
supports scalable multi-task training, and adapts smoothly to any resolution,
enhancing both generalization and scalability. Experimental results demonstrate
competitive performance across tasks in both generation and prediction such as
text-to-image, multiview generation, ID preservation, depth estimation and
camera pose estimation despite relatively small training dataset. Our code and
checkpoint are freely available at https://github.com/lehduong/OneDiffusionSummary
AI-Generated Summary