DiTaiListener:基於擴散模型的高保真可控聽眾視頻生成
DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion
April 5, 2025
作者: Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, Mohammad Soleymani
cs.AI
摘要
在長時間互動中生成自然且細膩的聆聽者動作仍是一個未解的難題。現有方法通常依賴於低維運動編碼來生成面部行為,再進行逼真渲染,這既限制了視覺逼真度,也制約了表達的豐富性。為應對這些挑戰,我們引入了基於多模態條件視頻擴散模型的DiTaiListener。我們的方法首先利用DiTaiListener-Gen,根據說話者的語音和面部動作生成短暫的聆聽者回應片段,隨後通過DiTaiListener-Edit精修過渡幀,以實現無縫轉換。具體而言,DiTaiListener-Gen通過引入因果時序多模態適配器(CTM-Adapter)來處理說話者的聽覺和視覺線索,從而將擴散變換器(DiT)應用於聆聽者頭像生成任務。CTM-Adapter以因果方式將說話者的輸入整合到視頻生成過程中,確保聆聽者回應的時間一致性。對於長視頻生成,我們引入了DiTaiListener-Edit,這是一個過渡精修的視頻到視頻擴散模型。該模型將視頻片段融合成流暢連續的視頻,確保在合併由DiTaiListener-Gen生成的短視頻片段時,面部表情和圖像質量的時間一致性。定量分析顯示,DiTaiListener在基準數據集上於逼真度(RealTalk上FID提升73.8%)和動作表現(VICO上FD指標提升6.1%)兩個方面均達到了業界領先水平。用戶研究進一步證實了DiTaiListener的卓越性能,該模型在反饋、多樣性和流暢性方面均顯著優於競爭對手,成為用戶的明確首選。
English
Generating naturalistic and nuanced listener motions for extended
interactions remains an open problem. Existing methods often rely on
low-dimensional motion codes for facial behavior generation followed by
photorealistic rendering, limiting both visual fidelity and expressive
richness. To address these challenges, we introduce DiTaiListener, powered by a
video diffusion model with multimodal conditions. Our approach first generates
short segments of listener responses conditioned on the speaker's speech and
facial motions with DiTaiListener-Gen. It then refines the transitional frames
via DiTaiListener-Edit for a seamless transition. Specifically,
DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener
head portrait generation by introducing a Causal Temporal Multimodal Adapter
(CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter
integrates speakers' input in a causal manner into the video generation process
to ensure temporally coherent listener responses. For long-form video
generation, we introduce DiTaiListener-Edit, a transition refinement
video-to-video diffusion model. The model fuses video segments into smooth and
continuous videos, ensuring temporal consistency in facial expressions and
image quality when merging short video segments produced by DiTaiListener-Gen.
Quantitatively, DiTaiListener achieves the state-of-the-art performance on
benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion
representation (+6.1% in FD metric on VICO) spaces. User studies confirm the
superior performance of DiTaiListener, with the model being the clear
preference in terms of feedback, diversity, and smoothness, outperforming
competitors by a significant margin.Summary
AI-Generated Summary