DiTaiListener：基于扩散模型的高保真可控听者视频生成

摘要

在长时间互动中生成自然且细腻的倾听者动作仍是一个未解难题。现有方法多依赖于低维动作编码来生成面部行为，随后进行逼真渲染，这既限制了视觉保真度，也制约了表达的丰富性。为应对这些挑战，我们推出了DiTaiListener，它由具备多模态条件的视频扩散模型驱动。我们的方法首先利用DiTaiListener-Gen，基于说话者的语音和面部动作生成倾听者反应的短片段，随后通过DiTaiListener-Edit精修过渡帧，实现无缝衔接。具体而言，DiTaiListener-Gen通过引入因果时序多模态适配器（CTM-Adapter）来处理说话者的听觉与视觉线索，从而将扩散变换器（DiT）应用于倾听者头像生成任务。CTM-Adapter以因果方式将说话者输入整合到视频生成过程中，确保倾听者反应在时间上连贯。针对长视频生成，我们提出了DiTaiListener-Edit，这是一种过渡优化的视频到视频扩散模型。该模型将视频片段融合为流畅连续的影片，在合并由DiTaiListener-Gen生成的短视频片段时，确保面部表情和图像质量的时间一致性。定量分析显示，DiTaiListener在基准数据集上于逼真度（RealTalk上FID提升73.8%）和动作表现（VICO上FD指标提升6.1%）两方面均达到了业界领先水平。用户研究进一步证实了DiTaiListener的卓越性能，在反馈、多样性和流畅性方面，该模型明显优于竞争对手，成为用户的首选。

English

Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

DiTaiListener：基于扩散模型的高保真可控听者视频生成

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

摘要

Summary

Support

Support