HAIC:通过优化多模态大语言模型的字幕提升人类动作理解与生成能力
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
February 28, 2025
作者: Xiao Wang, Jingyun Hua, Weihong Lin, Yuanxing Zhang, Fuzheng Zhang, Jianlong Wu, Di Zhang, Liqiang Nie
cs.AI
摘要
近期,多模态大语言模型(MLLMs)在视频理解领域取得了显著进展。然而,在处理涉及人类行为的视频时,其性能仍受限于高质量数据的匮乏。为解决这一问题,我们引入了一个两阶段的数据标注流程。首先,我们设计了策略从互联网上收集包含清晰人类行为的视频。其次,采用标准化字幕格式对视频进行标注,该格式利用人类属性区分个体,并按照时间顺序详细描述其行为及互动。通过这一流程,我们构建了两个数据集,分别命名为HAICTrain和HAICBench。其中,HAICTrain包含由Gemini-Pro生成并经校验的126,000个视频-字幕对,专为训练目的而设计。与此同时,HAICBench则包含了500个手动标注的视频-字幕对及1,400个问答对,旨在全面评估人类行为理解能力。实验结果表明,使用HAICTrain进行训练不仅显著提升了在4个基准测试中的人类理解能力,还能改善文本到视频的生成效果。HAICTrain与HAICBench均已发布于https://huggingface.co/datasets/KuaishouHAIC/HAIC。
English
Recent Multi-modal Large Language Models (MLLMs) have made great progress in
video understanding. However, their performance on videos involving human
actions is still limited by the lack of high-quality data. To address this, we
introduce a two-stage data annotation pipeline. First, we design strategies to
accumulate videos featuring clear human actions from the Internet. Second,
videos are annotated in a standardized caption format that uses human
attributes to distinguish individuals and chronologically details their actions
and interactions. Through this pipeline, we curate two datasets, namely
HAICTrain and HAICBench. HAICTrain comprises 126K video-caption pairs
generated by Gemini-Pro and verified for training purposes. Meanwhile,
HAICBench includes 500 manually annotated video-caption pairs and
1,400 QA pairs, for a comprehensive evaluation of human action understanding.
Experimental results demonstrate that training with HAICTrain not only
significantly enhances human understanding abilities across 4 benchmarks, but
can also improve text-to-video generation results. Both the HAICTrain and
HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.Summary
AI-Generated Summary