PixWizard：具有開放式語言指令的多功能圖像對圖像視覺助手

摘要

本文介紹了一個多功能的圖像對圖像視覺助手 PixWizard，旨在基於自由形式的語言指令進行圖像生成、操作和翻譯。為此，我們將各種視覺任務統一到一個圖像-文本-圖像生成框架中，並編輯了一個全像素指令調整數據集。通過在自然語言中構建詳細的指令模板，我們全面包含了大量多樣的視覺任務，如文本到圖像生成、圖像恢復、圖像定位、密集圖像預測、圖像編輯、可控生成、修補/補全等。此外，我們採用擴散Transformer（DiT）作為基礎模型，並通過靈活的任意解析度機制擴展其功能，使模型能夠根據輸入的長寬比動態處理圖像，與人類感知過程密切對齊。該模型還融合了結構感知和語義感知指導，以促進從輸入圖像中信息的有效融合。我們的實驗表明，PixWizard 不僅展示了對具有不同解析度的圖像具有令人印象深刻的生成和理解能力，還展現了對未見任務和人類指令具有有前景的泛化能力。代碼和相關資源可在 https://github.com/AFeng-x/PixWizard 上找到。

English

This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at https://github.com/AFeng-x/PixWizard

PixWizard：具有開放式語言指令的多功能圖像對圖像視覺助手

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

摘要

Summary

Support

Support