PixWizard: 다목적 이미지 대 이미지 시각 보조 도구와 오픈 언어 지시사항

초록

본 논문은 이미지 생성, 조작 및 번역을 위한 다목적 이미지 대 이미지 비주얼 어시스턴트인 PixWizard를 제시합니다. 이를 위해 우리는 통합된 이미지-텍스트-이미지 생성 프레임워크로 다양한 시각 작업에 대응하고 Omni Pixel-to-Pixel Instruction-Tuning 데이터셋을 편집합니다. 자연어로 상세한 지시 템플릿을 구성함으로써 텍스트 대 이미지 생성, 이미지 복원, 이미지 매핑, 밀도 이미지 예측, 이미지 편집, 제어 가능한 생성, 인페인팅/아웃페인팅 등 다양한 시각 작업을 포괄적으로 포함합니다. 더불어, 우리는 Diffusion Transformers(DiT)를 기본 모델로 채택하고 유연한 해상도 메커니즘으로 기능을 확장하여 입력의 종횡비에 따라 이미지를 동적으로 처리할 수 있도록 하였습니다. 모델은 또한 입력 이미지로부터 정보를 효과적으로 융합하기 위해 구조 인식 및 의미 인식 가이던스를 통합합니다. 실험 결과, PixWizard는 다양한 해상도의 이미지에 대한 창조적이고 이해력 있는 능력을 보여주며, 보이지 않는 작업 및 인간 지시에 대한 유망한 일반화 능력을 나타냅니다. 코드 및 관련 자료는 https://github.com/AFeng-x/PixWizard에서 확인할 수 있습니다.

English

This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at https://github.com/AFeng-x/PixWizard

PixWizard: 다목적 이미지 대 이미지 시각 보조 도구와 오픈 언어 지시사항

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

초록

Summary

Support

Support