FitDiT:推进高保真度虚拟试穿的真实服装细节
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on
November 15, 2024
作者: Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Chengming Xu, Jinlong Peng, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Yanwei Fu
cs.AI
摘要
尽管基于图像的虚拟试穿已取得了相当大的进展,但新兴方法在跨多种场景生成高保真和稳健的试穿图像方面仍然面临挑战。这些方法通常在诸如纹理感知维护和尺寸感知试穿等问题上遇到困难,这些问题影响了它们的整体有效性。为了解决这些局限性,我们提出了一种新颖的服装感知增强技术,名为FitDiT,旨在利用扩散Transformer(DiT)进行高保真虚拟试穿,为高分辨率特征分配更多参数和注意力。首先,为了进一步改善纹理感知维护,我们引入了一种服装纹理提取器,结合服装先验演化来微调服装特征,有助于更好地捕捉条纹、图案和文字等丰富细节。此外,我们通过定制频域距离损失来引入频域学习,以增强高频服装细节。为了解决尺寸感知试穿问题,我们采用了一种扩张-放松蒙版策略,适应服装的正确长度,防止在跨类别试穿期间生成填满整个蒙版区域的服装。搭载上述设计,FitDiT在定性和定量评估中均超越了所有基准线。它擅长生成合身的服装,具有照片般逼真和精细的细节,同时在DiT结构精简后为单个1024x768图像实现了竞争性的推理时间,为4.57秒,优于现有方法。
English
Although image-based virtual try-on has made considerable progress, emerging
approaches still encounter challenges in producing high-fidelity and robust
fitting images across diverse scenarios. These methods often struggle with
issues such as texture-aware maintenance and size-aware fitting, which hinder
their overall effectiveness. To address these limitations, we propose a novel
garment perception enhancement technique, termed FitDiT, designed for
high-fidelity virtual try-on using Diffusion Transformers (DiT) allocating more
parameters and attention to high-resolution features. First, to further improve
texture-aware maintenance, we introduce a garment texture extractor that
incorporates garment priors evolution to fine-tune garment feature,
facilitating to better capture rich details such as stripes, patterns, and
text. Additionally, we introduce frequency-domain learning by customizing a
frequency distance loss to enhance high-frequency garment details. To tackle
the size-aware fitting issue, we employ a dilated-relaxed mask strategy that
adapts to the correct length of garments, preventing the generation of garments
that fill the entire mask area during cross-category try-on. Equipped with the
above design, FitDiT surpasses all baselines in both qualitative and
quantitative evaluations. It excels in producing well-fitting garments with
photorealistic and intricate details, while also achieving competitive
inference times of 4.57 seconds for a single 1024x768 image after DiT structure
slimming, outperforming existing methods.Summary
AI-Generated Summary