Tarsier2:从详细视频描述推进大型视觉语言模型到全面视频理解
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
January 14, 2025
作者: Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin
cs.AI
摘要
我们介绍了Tarsier2,这是一种最先进的大规模视觉语言模型(LVLM),旨在生成详细准确的视频描述,同时展现出卓越的视频理解能力。Tarsier2通过三个关键升级实现了显著的进展:(1)将预训练数据从11M扩展到40M个视频文本对,丰富了数据量和多样性;(2)在监督微调过程中执行精细的时间对齐;(3)利用基于模型的采样自动构建偏好数据,并应用DPO训练进行优化。大量实验证明,Tarsier2-7B在详细视频描述任务中始终优于领先的专有模型,包括GPT-4o和Gemini 1.5 Pro。在DREAM-1K基准测试中,Tarsier2-7B在F1值上比GPT-4o提高了2.8\%,比Gemini-1.5-Pro提高了5.8\%。在人类的并排评估中,Tarsier2-7B相对于GPT-4o表现出+8.6\%的性能优势,相对于Gemini-1.5-Pro表现出+24.9\%的优势。Tarsier2-7B还在15个公共基准测试中取得了新的最先进结果,涵盖视频问答、视频定位、幻觉测试和具身问答等任务,展示了其作为强大通用视觉语言模型的多功能性。
English
We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM)
designed for generating detailed and accurate video descriptions, while also
exhibiting superior general video understanding capabilities. Tarsier2 achieves
significant advancements through three key upgrades: (1) Scaling pre-training
data from 11M to 40M video-text pairs, enriching both volume and diversity; (2)
Performing fine-grained temporal alignment during supervised fine-tuning; (3)
Using model-based sampling to automatically construct preference data and
applying DPO training for optimization. Extensive experiments show that
Tarsier2-7B consistently outperforms leading proprietary models, including
GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K
benchmark, Tarsier2-7B improves F1 by 2.8\% over GPT-4o and 5.8\% over
Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\%
performance advantage over GPT-4o and +24.9\% over Gemini-1.5-Pro. Tarsier2-7B
also sets new state-of-the-art results across 15 public benchmarks, spanning
tasks such as video question-answering, video grounding, hallucination test,
and embodied question-answering, demonstrating its versatility as a robust
generalist vision-language model.Summary
AI-Generated Summary