使用合成數據進行視訊指導調校
Video Instruction Tuning With Synthetic Data
October 3, 2024
作者: Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li
cs.AI
摘要
由於從網絡中收集大量高質量原始數據的困難,影響了視頻大型多模型(LMMs)的發展。為了解決這個問題,我們提出了一種替代方法,通過為視頻指令跟隨創建一個高質量合成數據集,即LLaVA-Video-178K。該數據集包括詳細字幕、開放式問答(QA)和多選擇QA等關鍵任務。通過在這個數據集上訓練,結合現有的視覺指令調整數據,我們引入了一個新的視頻LMM,名為LLaVA-Video。我們的實驗表明,LLaVA-Video在各種視頻基準測試中取得了出色的表現,突出了我們數據集的有效性。我們計劃發布數據集、生成流程和模型檢查點。
English
The development of video large multimodal models (LMMs) has been hindered by
the difficulty of curating large amounts of high-quality raw data from the web.
To address this, we propose an alternative approach by creating a high-quality
synthetic dataset specifically for video instruction-following, namely
LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning,
open-ended question-answering (QA), and multiple-choice QA. By training on this
dataset, in combination with existing visual instruction tuning data, we
introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that
LLaVA-Video achieves strong performance across various video benchmarks,
highlighting the effectiveness of our dataset. We plan to release the dataset,
its generation pipeline, and the model checkpoints.Summary
AI-Generated Summary