LiveCC:基於大規模串流語音轉錄的影片大語言模型學習
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
April 22, 2025
作者: Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, Mike Zheng Shou
cs.AI
摘要
近年來的視頻大型語言模型(Video LLMs)往往依賴於昂貴的人工標註或專有模型API(如GPT-4o)來生成訓練數據,這限制了其大規模訓練的可能性。本文探討了利用低成本的自動語音識別(ASR)轉錄文本進行Video LLM的大規模訓練。具體而言,我們提出了一種新穎的流式訓練方法,根據時間戳將ASR詞語與視頻幀密集交織。與以往基於ASR的視覺-語言表徵研究相比,我們的方法自然契合ASR的流式特性,從而使得模型能夠學習到時間對齊的細粒度視覺-語言建模。為支持此訓練算法,我們引入了一個數據生產管道,處理YouTube視頻及其閉路字幕(CC,等同於ASR),產生了用於預訓練的Live-CC-5M數據集和用於高質量監督微調(SFT)的Live-WhisperX-526K數據集。值得注意的是,即使不進行SFT,僅基於ASR預訓練的LiveCC-7B-Base模型在通用視頻問答任務上展現了競爭力,並具備了實時視頻評論的新能力。為評估這一點,我們精心設計了新的LiveSports-3K基準,利用LLM-as-a-judge來衡量自由形式的評論質量。實驗表明,我們最終的LiveCC-7B-Instruct模型在評論質量上能夠超越先進的72B模型(如Qwen2.5-VL-72B-Instruct、LLaVA-Video-72B),即使在實時模式下工作。同時,它在7B/8B規模上於流行的視頻問答基準如VideoMME和OVOBench上取得了領先的成績,展示了我們方法的廣泛通用性。本文的所有資源已發佈於https://showlab.github.io/livecc。
English
Recent video large language models (Video LLMs) often depend on costly human
annotations or proprietary model APIs (e.g., GPT-4o) to produce training data,
which limits their training at scale. In this paper, we explore large-scale
training for Video LLM with cheap automatic speech recognition (ASR)
transcripts. Specifically, we propose a novel streaming training approach that
densely interleaves the ASR words and video frames according to their
timestamps. Compared to previous studies in vision-language representation with
ASR, our method naturally fits the streaming characteristics of ASR, thus
enabling the model to learn temporally-aligned, fine-grained vision-language
modeling. To support the training algorithm, we introduce a data production
pipeline to process YouTube videos and their closed captions (CC, same as ASR),
resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset
for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT,
the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general
video QA performance and exhibits a new capability in real-time video
commentary. To evaluate this, we carefully design a new LiveSports-3K
benchmark, using LLM-as-a-judge to measure the free-form commentary.
Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B
models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even
working in a real-time mode. Meanwhile, it achieves state-of-the-art results at
the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench,
demonstrating the broad generalizability of our approach. All resources of this
paper have been released at https://showlab.github.io/livecc.Summary
AI-Generated Summary