2.5年在課堂上:一本視覺語言預訓練的多模態教科書
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
January 1, 2025
作者: Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing
cs.AI
摘要
相較於圖像-文字配對數據,交錯語料庫使視覺語言模型(VLM)能更自然地理解世界,就像人類一樣。然而,這些現有數據集是從網頁爬取的,面臨著知識密度低、圖像與文字關係鬆散以及圖像之間邏輯連貫性差等挑戰。另一方面,互聯網上有大量的教學視頻(例如,在線幾何課程),被人們廣泛用於學習基礎科目,然而這些寶貴資源在VLM訓練中尚未得到充分利用。本文介紹了一個高質量的多模態教科書語料庫,為VLM預訓練提供更豐富的基礎知識。它收集了超過2.5年的教學視頻,總計22,000課時。我們首先使用LLM提出的分類法系統地收集教學視頻。然後,我們逐步從視頻中提取和精煉視覺(關鍵幀)、音頻(ASR)和文本知識(OCR),並根據時間順序組織成一個基於圖像和文字交錯的語料庫。與同類產品相比,我們以視頻為中心的教科書提供了更一致的上下文、更豐富的知識和更好的圖像-文字對齊。實驗證明了其出色的預訓練性能,特別是在知識和推理密集型任務(如ScienceQA和MathVista)中。此外,在我們的教科書上預先訓練的VLM表現出優秀的交錯上下文意識,利用視覺和文本線索在少量樣本上下文中解決任務。我們的代碼可在\url{https://github.com/DAMO-NLP-SG/multimodal_textbook}找到。
English
Compared to image-text pair data, interleaved corpora enable Vision-Language
Models (VLMs) to understand the world more naturally like humans. However, such
existing datasets are crawled from webpage, facing challenges like low
knowledge density, loose image-text relations, and poor logical coherence
between images. On the other hand, the internet hosts vast instructional videos
(e.g., online geometry courses) that are widely used by humans to learn
foundational subjects, yet these valuable resources remain underexplored in VLM
training. In this paper, we introduce a high-quality multimodal
textbook corpus with richer foundational knowledge for VLM pretraining. It
collects over 2.5 years of instructional videos, totaling 22,000 class hours.
We first use an LLM-proposed taxonomy to systematically gather instructional
videos. Then we progressively extract and refine visual (keyframes), audio
(ASR), and textual knowledge (OCR) from the videos, and organize as an
image-text interleaved corpus based on temporal order. Compared to its
counterparts, our video-centric textbook offers more coherent context, richer
knowledge, and better image-text alignment. Experiments demonstrate its superb
pretraining performance, particularly in knowledge- and reasoning-intensive
tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook
exhibit outstanding interleaved context awareness, leveraging visual and
textual cues in their few-shot context for task solving~Our code are
available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}.Summary
AI-Generated Summary