感知精确的三维说话头生成:新定义、语音-网格表示与评估指标
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
March 26, 2025
作者: Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Kim Sung-Bin, Suekyeong Nam, Tae-Hyun Oh
cs.AI
摘要
近期,语音驱动的3D说话头生成技术在唇形同步方面取得了显著进展。然而,现有模型在捕捉不同语音特征与相应唇部动作之间的感知对齐方面仍面临挑战。本研究中,我们提出三个关键标准——时间同步性、唇部可读性和表现力——对于实现感知准确的唇部动作至关重要。基于存在一个理想表示空间以满足这三项标准的假设,我们引入了一种语音-网格同步表示法,该方法能够捕捉语音信号与3D面部网格之间的精细对应关系。我们发现,所学习的表示展现出理想特性,并将其作为感知损失函数嵌入现有模型,以更好地将唇部动作与给定语音对齐。此外,我们利用这一表示作为感知度量,并引入另外两个基于物理基础的唇形同步度量,以评估生成的3D说话头与这三项标准的对齐程度。实验表明,采用我们的感知损失函数训练3D说话头生成模型,在感知准确的唇形同步的三个方面均实现了显著提升。代码与数据集已发布于https://perceptual-3d-talking-head.github.io/。
English
Recent advancements in speech-driven 3D talking head generation have made
significant progress in lip synchronization. However, existing models still
struggle to capture the perceptual alignment between varying speech
characteristics and corresponding lip movements. In this work, we claim that
three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness
-- are crucial for achieving perceptually accurate lip movements. Motivated by
our hypothesis that a desirable representation space exists to meet these three
criteria, we introduce a speech-mesh synchronized representation that captures
intricate correspondences between speech signals and 3D face meshes. We found
that our learned representation exhibits desirable characteristics, and we plug
it into existing models as a perceptual loss to better align lip movements to
the given speech. In addition, we utilize this representation as a perceptual
metric and introduce two other physically grounded lip synchronization metrics
to assess how well the generated 3D talking heads align with these three
criteria. Experiments show that training 3D talking head generation models with
our perceptual loss significantly improve all three aspects of perceptually
accurate lip synchronization. Codes and datasets are available at
https://perceptual-3d-talking-head.github.io/.Summary
AI-Generated Summary