2.5年的课堂时间：一本面向视觉-语言预训练的多模态教材

摘要

与图像-文本对数据相比，交错语料库使视觉-语言模型（VLMs）能够更自然地理解世界，就像人类一样。然而，这些现有数据集是从网页抓取的，面临着知识密度低、图像-文本关系松散以及图像之间逻辑连贯性差等挑战。另一方面，互联网上有大量的教学视频（例如在线几何课程），被人类广泛用于学习基础学科，然而这些宝贵资源在VLM训练中仍未得到充分探索。本文介绍了一个高质量的多模态教科书语料库，为VLM预训练提供了更丰富的基础知识。它汇集了超过2.5年的教学视频，总计22,000课时。我们首先使用LLM提出的分类法系统地收集教学视频。然后，我们逐步从视频中提取和精炼视觉（关键帧）、音频（ASR）和文本知识（OCR），并按照时间顺序组织为一个基于图像-文本交错的语料库。与同类产品相比，我们以视频为中心的教科书提供了更连贯的上下文、更丰富的知识和更好的图像-文本对齐。实验证明了它在预训练性能方面的出色表现，特别是在科学问答和数学问题解决等知识和推理密集型任务中。此外，预训练于我们教科书的VLM表现出色的交错上下文意识，利用视觉和文本线索在少样本上下文中解决任务。我们的代码可在 \url{https://github.com/DAMO-NLP-SG/multimodal_textbook} 获取。

English

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality multimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}.

2.5年的课堂时间：一本面向视觉-语言预训练的多模态教材

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

摘要

Summary

Support