SeFAR:具有時間擾動和學習穩定化的半監督細粒度動作識別
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
January 2, 2025
作者: Yongle Huang, Haodong Chen, Zhenbang Xu, Zihan Jia, Haozhou Sun, Dian Shao
cs.AI
摘要
人類動作理解對於多模式系統的進步至關重要。儘管最近的發展主要受到強大的大型語言模型(LLMs)的推動,旨在具有足夠的通用性以涵蓋各種類別,但它們常常忽略了對更具體能力的需求。在這項工作中,我們致力於更具挑戰性的細粒度動作識別(FAR)任務,該任務專注於較短時間範圍內的詳細語義標籤(例如,“帶有1個轉身的後空翻”)。鑒於標註細粒度標籤的高成本以及對調整LLMs所需的大量數據,我們提出採用半監督學習(SSL)。我們的框架SeFAR融合了幾個創新設計來應對這些挑戰。具體來說,為了捕捉足夠的視覺細節,我們構建了雙層次時間元素作為更有效的表示,基於此,我們設計了一種新的強大增強策略,用於教師-學生學習範式,通過引入適度的時間擾動。此外,為了應對教師模型對FAR預測中的高不確定性,我們提出了適應性調節以穩定學習過程。實驗表明,SeFAR在兩個FAR數據集FineGym和FineDiving上實現了最先進的性能,跨越各種數據範圍。它還在兩個經典粗粒度數據集UCF101和HMDB51上優於其他半監督方法。進一步的分析和消融研究驗證了我們設計的有效性。此外,我們展示了我們的SeFAR提取的特徵可以很大程度上提升多模式基礎模型理解細粒度和特定領域語義的能力。
English
Human action understanding is crucial for the advancement of multimodal
systems. While recent developments, driven by powerful large language models
(LLMs), aim to be general enough to cover a wide range of categories, they
often overlook the need for more specific capabilities. In this work, we
address the more challenging task of Fine-grained Action Recognition (FAR),
which focuses on detailed semantic labels within shorter temporal duration
(e.g., "salto backward tucked with 1 turn"). Given the high costs of annotating
fine-grained labels and the substantial data needed for fine-tuning LLMs, we
propose to adopt semi-supervised learning (SSL). Our framework, SeFAR,
incorporates several innovative designs to tackle these challenges.
Specifically, to capture sufficient visual details, we construct Dual-level
temporal elements as more effective representations, based on which we design a
new strong augmentation strategy for the Teacher-Student learning paradigm
through involving moderate temporal perturbation. Furthermore, to handle the
high uncertainty within the teacher model's predictions for FAR, we propose the
Adaptive Regulation to stabilize the learning process. Experiments show that
SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and
FineDiving, across various data scopes. It also outperforms other
semi-supervised methods on two classical coarse-grained datasets, UCF101 and
HMDB51. Further analysis and ablation studies validate the effectiveness of our
designs. Additionally, we show that the features extracted by our SeFAR could
largely promote the ability of multimodal foundation models to understand
fine-grained and domain-specific semantics.Summary
AI-Generated Summary