一種擴散以生成它們

摘要

我們介紹了 OneDiffusion，一個多功能的大規模擴散模型，無縫支持跨多樣任務的雙向圖像合成和理解。它能從輸入如文本、深度、姿勢、佈局和語義地圖進行有條件生成，同時處理圖像去模糊、放大，以及深度估計和分割等反向過程。此外，OneDiffusion 還支持多視角生成、相機姿勢估計，並利用連續圖像輸入進行即時個性化。我們的模型採用直接但有效的方法，將所有任務視為在訓練期間具有不同噪聲尺度的幀序列，從而使任何幀都能在推論時作為條件圖像。我們統一的訓練框架消除了專用架構的需求，支持可擴展的多任務訓練，並且能平滑地適應任何分辨率，增強泛化能力和可擴展性。實驗結果表明，儘管訓練數據集相對較小，我們的代碼和檢查點在各種任務中展現了競爭力，包括文本到圖像、多視角生成、ID 保留、深度估計和相機姿勢估計等生成和預測。我們的代碼和檢查點可在 https://github.com/lehduong/OneDiffusion 免費獲取。

English

We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at https://github.com/lehduong/OneDiffusion

一種擴散以生成它們

One Diffusion to Generate Them All

摘要

Support