FitDiT:推進高保真度虛擬試穿的真實服裝細節
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on
November 15, 2024
作者: Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Chengming Xu, Jinlong Peng, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Yanwei Fu
cs.AI
摘要
儘管基於圖像的虛擬試穿已取得相當大的進展,但新興方法在各種情境下生成高保真度和穩健的合身圖像仍然面臨挑戰。這些方法通常在紋理感知維護和尺寸感知合身等問題上遇到困難,這些問題影響了它們的整體有效性。為了解決這些限制,我們提出了一種新穎的服裝感知增強技術,稱為FitDiT,旨在使用擴散Transformer(DiT)進行高保真度的虛擬試穿,該技術分配更多參數和注意力於高分辨率特徵。首先,為了進一步改善紋理感知維護,我們引入了一種服裝紋理提取器,該提取器融合了服裝先驗進化,以微調服裝特徵,有助於更好地捕捉條紋、圖案和文字等豐富細節。此外,我們通過定製頻率距離損失,引入頻域學習,以增強高頻服裝細節。為了應對尺寸感知合身問題,我們採用了一種膨脹-放鬆的遮罩策略,適應服裝的正確長度,防止在跨類別試穿期間生成填滿整個遮罩區域的服裝。憑藉上述設計,FitDiT在定性和定量評估中均超越了所有基準線。它擅長生成合身的服裝,具有照片般逼真和精細的細節,同時在DiT結構瘦身後,單張1024x768圖像的推理時間為4.57秒,勝過現有方法。
English
Although image-based virtual try-on has made considerable progress, emerging
approaches still encounter challenges in producing high-fidelity and robust
fitting images across diverse scenarios. These methods often struggle with
issues such as texture-aware maintenance and size-aware fitting, which hinder
their overall effectiveness. To address these limitations, we propose a novel
garment perception enhancement technique, termed FitDiT, designed for
high-fidelity virtual try-on using Diffusion Transformers (DiT) allocating more
parameters and attention to high-resolution features. First, to further improve
texture-aware maintenance, we introduce a garment texture extractor that
incorporates garment priors evolution to fine-tune garment feature,
facilitating to better capture rich details such as stripes, patterns, and
text. Additionally, we introduce frequency-domain learning by customizing a
frequency distance loss to enhance high-frequency garment details. To tackle
the size-aware fitting issue, we employ a dilated-relaxed mask strategy that
adapts to the correct length of garments, preventing the generation of garments
that fill the entire mask area during cross-category try-on. Equipped with the
above design, FitDiT surpasses all baselines in both qualitative and
quantitative evaluations. It excels in producing well-fitting garments with
photorealistic and intricate details, while also achieving competitive
inference times of 4.57 seconds for a single 1024x768 image after DiT structure
slimming, outperforming existing methods.Summary
AI-Generated Summary