SeFAR：具有时间扰动和学习稳定化的半监督细粒度动作识别

摘要

人类行为理解对于多模态系统的发展至关重要。尽管最近的发展受到强大的大型语言模型（LLMs）驱动，旨在具有足够的通用性以涵盖广泛的类别，但它们经常忽视了对更具体能力的需求。在这项工作中，我们解决了更具挑战性的细粒度动作识别（FAR）任务，该任务侧重于在较短时间内提供详细的语义标签（例如，“带有1个转体的后空翻”）。鉴于注释细粒度标签的高成本以及对于微调LLMs所需的大量数据，我们提出采用半监督学习（SSL）。我们的框架SeFAR融合了几项创新设计来解决这些挑战。具体来说，为了捕获足够的视觉细节，我们构建了双层时间元素作为更有效的表示，基于此，我们设计了一种新的强大增强策略，用于教师-学生学习范式，通过引入适度的时间扰动。此外，为了处理教师模型对FAR的预测中存在的高不确定性，我们提出了自适应调节来稳定学习过程。实验证明，SeFAR在两个FAR数据集FineGym和FineDiving上实现了最先进的性能，涵盖了各种数据范围。它还在两个经典的粗粒度数据集UCF101和HMDB51上胜过其他半监督方法。进一步的分析和消融研究验证了我们设计的有效性。此外，我们展示了由SeFAR提取的特征可以极大地提升多模态基础模型理解细粒度和领域特定语义的能力。

English

Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Fine-grained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., "salto backward tucked with 1 turn"). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model's predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi-supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by our SeFAR could largely promote the ability of multimodal foundation models to understand fine-grained and domain-specific semantics.

SeFAR：具有时间扰动和学习稳定化的半监督细粒度动作识别

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

摘要

Support