理解现实场景中的伴随言语手势

摘要

伴随言语的手势在非语言交流中扮演着至关重要的角色。本文中，我们引入了一个新的框架，用于在自然场景下理解伴随言语的手势。具体而言，我们提出了三项新任务及基准，以评估模型理解手势-文本-语音关联的能力：（一）基于手势的检索，（二）手势词汇定位，以及（三）利用手势进行主动说话者检测。我们提出了一种新方法，通过学习语音-文本-视频-手势的三模态表示来解决这些任务。通过结合全局短语对比损失和局部手势-词汇耦合损失，我们证明了可以从自然场景的视频中以弱监督方式学习到强有力的手势表示。在所有三项任务中，我们学习到的表示均超越了包括大型视觉-语言模型（VLMs）在内的先前方法。进一步分析表明，语音和文本模态捕捉到了不同的手势相关信号，这凸显了学习共享三模态嵌入空间的优势。数据集、模型及代码可在以下网址获取：https://www.robots.ox.ac.uk/~vgg/research/jegal

English

Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal

理解现实场景中的伴随言语手势

Understanding Co-speech Gestures in-the-wild

摘要

Summary

Support

Support