视听控制视频扩散：基于掩码选择性状态空间建模的自然说话头像生成

摘要

说话头合成技术对于虚拟化身和人机交互至关重要。然而，现有方法大多局限于单一主模态的控制，限制了其实际应用价值。为此，我们提出了ACTalker，一个端到端的视频扩散框架，支持多信号和单信号控制以生成说话头视频。针对多信号控制，我们设计了一种并行mamba结构，包含多个分支，每个分支利用独立的驱动信号控制特定面部区域。所有分支间采用门控机制，为视频生成提供灵活控制。为确保生成视频在时间和空间上的自然协调，我们采用mamba结构，使驱动信号能够在每个分支中跨维度操控特征标记。此外，我们引入了一种掩码丢弃策略，允许每个驱动信号在mamba结构内独立控制其对应的面部区域，避免控制冲突。实验结果表明，我们的方法能够生成由多种信号驱动的自然面部视频，且mamba层能够无缝整合多种驱动模态而不产生冲突。

English

Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce ACTalker, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict.

视听控制视频扩散：基于掩码选择性状态空间建模的自然说话头像生成

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

摘要

Summary

Support

Support