一拍即說:單張圖像生成全身說話化身

One Shot, One Talk: Whole-body Talking Avatar from a Single Image

December 2, 2024
作者: Jun Xiang, Yudong Guo, Leipeng Hu, Boyang Guo, Yancheng Yuan, Juyong Zhang
cs.AI

摘要

建構逼真且可動畫化的頭像仍需數分鐘的多視角或單眼自轉影片,而大多數方法缺乏對手勢和表情的精確控制。為了突破這一界限,我們解決了從單張圖像建構全身說話頭像的挑戰。我們提出了一個新穎的流程,解決了兩個關鍵問題:1)複雜的動態建模和2)對新手勢和表情的泛化。為了實現無縫泛化,我們利用最近的姿勢引導圖像到視頻擴散模型來生成不完美的視頻幀作為虛標籤。為了克服由不一致和嘈雜的虛擬視頻帶來的動態建模挑戰,我們引入了緊密耦合的3DGS-網格混合頭像表示,並應用了幾個關鍵的正則化方法來減輕由於不完美標籤引起的不一致性。對於不同主題的廣泛實驗表明,我們的方法使得僅憑一張圖像就能創建出逼真、精確可動畫且表現豐富的全身說話頭像。
English
Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

Summary

AI-Generated Summary

PDF182December 5, 2024