2.5年在課堂上：一本視覺語言預訓練的多模態教科書

摘要

相較於圖像-文字配對數據，交錯語料庫使視覺語言模型（VLM）能更自然地理解世界，就像人類一樣。然而，這些現有數據集是從網頁爬取的，面臨著知識密度低、圖像與文字關係鬆散以及圖像之間邏輯連貫性差等挑戰。另一方面，互聯網上有大量的教學視頻（例如，在線幾何課程），被人們廣泛用於學習基礎科目，然而這些寶貴資源在VLM訓練中尚未得到充分利用。本文介紹了一個高質量的多模態教科書語料庫，為VLM預訓練提供更豐富的基礎知識。它收集了超過2.5年的教學視頻，總計22,000課時。我們首先使用LLM提出的分類法系統地收集教學視頻。然後，我們逐步從視頻中提取和精煉視覺（關鍵幀）、音頻（ASR）和文本知識（OCR），並根據時間順序組織成一個基於圖像和文字交錯的語料庫。與同類產品相比，我們以視頻為中心的教科書提供了更一致的上下文、更豐富的知識和更好的圖像-文字對齊。實驗證明了其出色的預訓練性能，特別是在知識和推理密集型任務（如ScienceQA和MathVista）中。此外，在我們的教科書上預先訓練的VLM表現出優秀的交錯上下文意識，利用視覺和文本線索在少量樣本上下文中解決任務。我們的代碼可在\url{https://github.com/DAMO-NLP-SG/multimodal_textbook}找到。

English

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality multimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}.

2.5年在課堂上：一本視覺語言預訓練的多模態教科書

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

摘要

Summary

Support