DiTaiListener:基于扩散模型的高保真可控听者视频生成
DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion
April 5, 2025
作者: Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, Mohammad Soleymani
cs.AI
摘要
在长时间互动中生成自然且细腻的倾听者动作仍是一个未解难题。现有方法多依赖于低维动作编码来生成面部行为,随后进行逼真渲染,这既限制了视觉保真度,也制约了表达的丰富性。为应对这些挑战,我们推出了DiTaiListener,它由具备多模态条件的视频扩散模型驱动。我们的方法首先利用DiTaiListener-Gen,基于说话者的语音和面部动作生成倾听者反应的短片段,随后通过DiTaiListener-Edit精修过渡帧,实现无缝衔接。具体而言,DiTaiListener-Gen通过引入因果时序多模态适配器(CTM-Adapter)来处理说话者的听觉与视觉线索,从而将扩散变换器(DiT)应用于倾听者头像生成任务。CTM-Adapter以因果方式将说话者输入整合到视频生成过程中,确保倾听者反应在时间上连贯。针对长视频生成,我们提出了DiTaiListener-Edit,这是一种过渡优化的视频到视频扩散模型。该模型将视频片段融合为流畅连续的影片,在合并由DiTaiListener-Gen生成的短视频片段时,确保面部表情和图像质量的时间一致性。定量分析显示,DiTaiListener在基准数据集上于逼真度(RealTalk上FID提升73.8%)和动作表现(VICO上FD指标提升6.1%)两方面均达到了业界领先水平。用户研究进一步证实了DiTaiListener的卓越性能,在反馈、多样性和流畅性方面,该模型明显优于竞争对手,成为用户的首选。
English
Generating naturalistic and nuanced listener motions for extended
interactions remains an open problem. Existing methods often rely on
low-dimensional motion codes for facial behavior generation followed by
photorealistic rendering, limiting both visual fidelity and expressive
richness. To address these challenges, we introduce DiTaiListener, powered by a
video diffusion model with multimodal conditions. Our approach first generates
short segments of listener responses conditioned on the speaker's speech and
facial motions with DiTaiListener-Gen. It then refines the transitional frames
via DiTaiListener-Edit for a seamless transition. Specifically,
DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener
head portrait generation by introducing a Causal Temporal Multimodal Adapter
(CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter
integrates speakers' input in a causal manner into the video generation process
to ensure temporally coherent listener responses. For long-form video
generation, we introduce DiTaiListener-Edit, a transition refinement
video-to-video diffusion model. The model fuses video segments into smooth and
continuous videos, ensuring temporal consistency in facial expressions and
image quality when merging short video segments produced by DiTaiListener-Gen.
Quantitatively, DiTaiListener achieves the state-of-the-art performance on
benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion
representation (+6.1% in FD metric on VICO) spaces. User studies confirm the
superior performance of DiTaiListener, with the model being the clear
preference in terms of feedback, diversity, and smoothness, outperforming
competitors by a significant margin.Summary
AI-Generated Summary