ChatPaper.aiChatPaper

DiTaiListener:基於擴散模型的高保真可控聽眾視頻生成

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

April 5, 2025
作者: Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, Mohammad Soleymani
cs.AI

摘要

在長時間互動中生成自然且細膩的聆聽者動作仍是一個未解的難題。現有方法通常依賴於低維運動編碼來生成面部行為,再進行逼真渲染,這既限制了視覺逼真度,也制約了表達的豐富性。為應對這些挑戰,我們引入了基於多模態條件視頻擴散模型的DiTaiListener。我們的方法首先利用DiTaiListener-Gen,根據說話者的語音和面部動作生成短暫的聆聽者回應片段,隨後通過DiTaiListener-Edit精修過渡幀,以實現無縫轉換。具體而言,DiTaiListener-Gen通過引入因果時序多模態適配器(CTM-Adapter)來處理說話者的聽覺和視覺線索,從而將擴散變換器(DiT)應用於聆聽者頭像生成任務。CTM-Adapter以因果方式將說話者的輸入整合到視頻生成過程中,確保聆聽者回應的時間一致性。對於長視頻生成,我們引入了DiTaiListener-Edit,這是一個過渡精修的視頻到視頻擴散模型。該模型將視頻片段融合成流暢連續的視頻,確保在合併由DiTaiListener-Gen生成的短視頻片段時,面部表情和圖像質量的時間一致性。定量分析顯示,DiTaiListener在基準數據集上於逼真度(RealTalk上FID提升73.8%)和動作表現(VICO上FD指標提升6.1%)兩個方面均達到了業界領先水平。用戶研究進一步證實了DiTaiListener的卓越性能,該模型在反饋、多樣性和流暢性方面均顯著優於競爭對手,成為用戶的明確首選。
English
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

Summary

AI-Generated Summary

PDF82April 10, 2025