Tarsier2:從詳細視頻描述推進大型視覺語言模型至全面視頻理解
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
January 14, 2025
作者: Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin
cs.AI
摘要
我們介紹了 Tarsier2,一款最先進的大型視覺語言模型(LVLM),旨在生成詳細準確的視頻描述,同時展現出卓越的視頻理解能力。Tarsier2 通過三個關鍵升級取得了顯著進展:(1)將預訓練數據從 1100 萬擴展到 4000 萬個視頻文本對,豐富了數據量和多樣性;(2)在監督微調期間執行精細的時間對齊;(3)使用基於模型的抽樣來自動構建偏好數據,並應用 DPO 訓練進行優化。大量實驗表明,Tarsier2-7B 在詳細視頻描述任務中始終優於領先的專有模型,包括 GPT-4o 和 Gemini 1.5 Pro。在 DREAM-1K 基準測試中,Tarsier2-7B 將 F1 值比 GPT-4o 提高了 2.8\%,比 Gemini-1.5-Pro 提高了 5.8\%。在人類並排評估中,Tarsier2-7B 表現優於 GPT-4o 8.6\%,優於 Gemini-1.5-Pro 24.9\%。Tarsier2-7B 還在 15 個公共基準測試中創下了新的最先進成果,涵蓋了視頻問答、視頻定位、幻覺測試和具體問答等任務,展示了其作為強大通用視覺語言模型的多功能性。
English
We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM)
designed for generating detailed and accurate video descriptions, while also
exhibiting superior general video understanding capabilities. Tarsier2 achieves
significant advancements through three key upgrades: (1) Scaling pre-training
data from 11M to 40M video-text pairs, enriching both volume and diversity; (2)
Performing fine-grained temporal alignment during supervised fine-tuning; (3)
Using model-based sampling to automatically construct preference data and
applying DPO training for optimization. Extensive experiments show that
Tarsier2-7B consistently outperforms leading proprietary models, including
GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K
benchmark, Tarsier2-7B improves F1 by 2.8\% over GPT-4o and 5.8\% over
Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\%
performance advantage over GPT-4o and +24.9\% over Gemini-1.5-Pro. Tarsier2-7B
also sets new state-of-the-art results across 15 public benchmarks, spanning
tasks such as video question-answering, video grounding, hallucination test,
and embodied question-answering, demonstrating its versatility as a robust
generalist vision-language model.Summary
AI-Generated Summary