PixelMan: 픽셀 조작 및 생성을 통한 확산 모델을 활용한 일관된 객체 편집

초록

최근 연구에서는 일관된 객체 편집을 위한 확산 모델(Diffusion Models, DMs)의 잠재력을 탐구하고 있으며, 이는 객체의 위치, 크기, 구성 등을 수정하면서 객체와 배경의 일관성을 유지하고 텍스처와 속성을 변경하지 않는 것을 목표로 합니다. 현재 추론 시간 방법은 주로 DDIM 역변환에 의존하는데, 이는 효율성과 편집된 이미지의 일관성을 달성하는 데 있어서 본질적으로 효율성을 저해합니다. 최근 방법은 또한 에너지 가이드를 활용하여 예측된 노이즈를 반복적으로 업데이트하고 잠재 변수를 원본 이미지에서 멀어지게하여 왜곡을 초래할 수 있습니다. 본 논문에서는 픽셀 조작 및 생성을 통해 일관된 객체 편집을 달성하기 위한 비역전 및 비학습 방법인 PixelMan을 제안합니다. 여기서는 픽셀 공간에서 원본 객체의 복제본을 직접 생성하여 대상 위치에 배치하고, 효율적인 샘플링 접근 방식을 도입하여 조작된 객체를 대상 위치로 반복적으로 조화시키고 원래 위치를 인페인트하면서 추론 중에 이미지 일관성을 보장합니다. 또한, 편집된 이미지를 픽셀 조작된 이미지에 고정시키고 추론 중에 다양한 일관성 보존 최적화 기술을 도입함으로써 이미지 일관성을 보장합니다. 벤치마크 데이터셋을 기반으로 한 실험적 평가 및 광범위한 시각적 비교를 통해, PixelMan은 16회의 추론 단계만으로도 다양한 일관된 객체 편집 작업에서 최첨단 학습 기반 및 비학습 기반 방법(일반적으로 50회의 단계가 필요한)을 능가함을 보여줍니다.

English

Recent research explores the potential of Diffusion Models (DMs) for consistent object editing, which aims to modify object position, size, and composition, etc., while preserving the consistency of objects and background without changing their texture and attributes. Current inference-time methods often rely on DDIM inversion, which inherently compromises efficiency and the achievable consistency of edited images. Recent methods also utilize energy guidance which iteratively updates the predicted noise and can drive the latents away from the original image, resulting in distortions. In this paper, we propose PixelMan, an inversion-free and training-free method for achieving consistent object editing via Pixel Manipulation and generation, where we directly create a duplicate copy of the source object at target location in the pixel space, and introduce an efficient sampling approach to iteratively harmonize the manipulated object into the target location and inpaint its original location, while ensuring image consistency by anchoring the edited image to be generated to the pixel-manipulated image as well as by introducing various consistency-preserving optimization techniques during inference. Experimental evaluations based on benchmark datasets as well as extensive visual comparisons show that in as few as 16 inference steps, PixelMan outperforms a range of state-of-the-art training-based and training-free methods (usually requiring 50 steps) on multiple consistent object editing tasks.

PixelMan: 픽셀 조작 및 생성을 통한 확산 모델을 활용한 일관된 객체 편집

PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation

초록

Summary

Support

Support