LiveCC:基于大规模流式语音转录的视频大语言模型学习
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
April 22, 2025
作者: Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, Mike Zheng Shou
cs.AI
摘要
近期,视频大语言模型(Video LLMs)常依赖昂贵的人工标注或专有模型API(如GPT-4o)来生成训练数据,这限制了其大规模训练的可能性。本文探讨了利用低成本自动语音识别(ASR)转录文本进行视频大语言模型的大规模训练。具体而言,我们提出了一种新颖的流式训练方法,该方法根据时间戳密集地交织ASR词汇与视频帧。与以往基于ASR的视觉-语言表示研究相比,我们的方法自然契合ASR的流式特性,从而使模型能够学习时间对齐的细粒度视觉-语言建模。为支持该训练算法,我们引入了一个数据处理流程,用于处理YouTube视频及其隐藏字幕(CC,等同于ASR),生成了用于预训练的Live-CC-5M数据集和用于高质量监督微调(SFT)的Live-WhisperX-526K数据集。值得注意的是,即便不进行SFT,仅通过ASR预训练的LiveCC-7B-Base模型在通用视频问答任务中展现出竞争力,并具备实时视频评论的新能力。为评估此能力,我们精心设计了一个新的LiveSports-3K基准,采用LLM作为评判者来衡量自由形式的评论质量。实验表明,我们的最终模型LiveCC-7B-Instruct在实时模式下,其评论质量甚至超越了先进的72B模型(如Qwen2.5-VL-72B-Instruct、LLaVA-Video-72B)。同时,在VideoMME和OVOBench等流行视频问答基准测试中,该模型在7B/8B规模上取得了最先进的成果,充分证明了我们方法的广泛通用性。本文所有资源已发布于https://showlab.github.io/livecc。
English
Recent video large language models (Video LLMs) often depend on costly human
annotations or proprietary model APIs (e.g., GPT-4o) to produce training data,
which limits their training at scale. In this paper, we explore large-scale
training for Video LLM with cheap automatic speech recognition (ASR)
transcripts. Specifically, we propose a novel streaming training approach that
densely interleaves the ASR words and video frames according to their
timestamps. Compared to previous studies in vision-language representation with
ASR, our method naturally fits the streaming characteristics of ASR, thus
enabling the model to learn temporally-aligned, fine-grained vision-language
modeling. To support the training algorithm, we introduce a data production
pipeline to process YouTube videos and their closed captions (CC, same as ASR),
resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset
for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT,
the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general
video QA performance and exhibits a new capability in real-time video
commentary. To evaluate this, we carefully design a new LiveSports-3K
benchmark, using LLM-as-a-judge to measure the free-form commentary.
Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B
models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even
working in a real-time mode. Meanwhile, it achieves state-of-the-art results at
the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench,
demonstrating the broad generalizability of our approach. All resources of this
paper have been released at https://showlab.github.io/livecc.Summary
AI-Generated Summary