CatV2TON:利用时间串联驯服扩散变压器进行基于视觉的虚拟试穿
CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation
January 20, 2025
作者: Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, Xiaodan Liang
cs.AI
摘要
虚拟试穿(VTON)技术因其在图像和视频中实现逼真服装可视化的潜力而备受关注,有望改变在线零售。然而,大多数现有方法在图像和视频试穿任务中难以实现高质量的结果,尤其是在长视频场景中。在这项工作中,我们介绍了CatV2TON,一种简单而有效的基于视觉的虚拟试穿(V2TON)方法,支持图像和视频试穿任务,只需一个扩散变压器模型。通过时间上连接服装和人物输入,并在图像和视频数据集的混合训练下,CatV2TON 在静态和动态环境中实现了强大的试穿性能。为了实现高效的长视频生成,我们提出了一种基于重叠剪辑的推断策略,利用顺序帧引导和自适应剪辑归一化(AdaCN)来保持时间一致性,并降低资源需求。我们还提出了ViViD-S,一个经过优化的视频试穿数据集,通过过滤背面帧并应用3D面罩平滑处理,增强了时间一致性。全面的实验证明,CatV2TON 在图像和视频试穿任务中优于现有方法,为实现逼真虚拟试穿在各种场景中提供了多功能且可靠的解决方案。
English
Virtual try-on (VTON) technology has gained attention due to its potential to
transform online retail by enabling realistic clothing visualization of images
and videos. However, most existing methods struggle to achieve high-quality
results across image and video try-on tasks, especially in long video
scenarios. In this work, we introduce CatV2TON, a simple and effective
vision-based virtual try-on (V2TON) method that supports both image and video
try-on tasks with a single diffusion transformer model. By temporally
concatenating garment and person inputs and training on a mix of image and
video datasets, CatV2TON achieves robust try-on performance across static and
dynamic settings. For efficient long-video generation, we propose an
overlapping clip-based inference strategy that uses sequential frame guidance
and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with
reduced resource demands. We also present ViViD-S, a refined video try-on
dataset, achieved by filtering back-facing frames and applying 3D mask
smoothing for enhanced temporal consistency. Comprehensive experiments
demonstrate that CatV2TON outperforms existing methods in both image and video
try-on tasks, offering a versatile and reliable solution for realistic virtual
try-ons across diverse scenarios.Summary
AI-Generated Summary