2.5年間のクラス：ビジョン言語のためのマルチモーダル教科書の事前学習

要旨

画像とテキストのペアデータと比較して、交互に配置されたコーパスは、ビジョン・ランゲージ・モデル（VLM）が人間のように世界をより自然に理解するのを可能にします。ただし、既存のこれらのデータセットはウェブページからクロールされており、知識密度が低い、画像とテキストの関係が緩やかである、画像間の論理的な整合性が低いなどの課題に直面しています。一方、インターネットには広範囲にわたる指導ビデオ（例：オンライン幾何学コース）があり、これらは人間が基礎科目を学ぶために広く利用されていますが、VLMのトレーニングにおいては未だ未開拓の貴重なリソースです。本論文では、VLMの事前トレーニングにより豊富な基礎知識を提供する高品質なマルチモーダル教科書コーパスを紹介します。これは、2.5年以上にわたる指導ビデオを収集し、合計22,000時間の授業を提供しています。まず、LLMが提案するタクソノミーを使用して、指導ビデオを体系的に収集します。その後、ビデオから視覚的な知識（キーフレーム）、音声（ASR）、およびテキストの知識（OCR）を段階的に抽出および洗練し、時間的順序に基づいて画像とテキストが交互に配置されたコーパスとして整理します。他の類似物と比較して、当社のビデオ中心の教科書は、より整合性のあるコンテキスト、豊富な知識、およびより良い画像とテキストの整列を提供します。実験により、その優れた事前トレーニングパフォーマンスが示され、特にScienceQAやMathVistaなどの知識と推論が必要なタスクにおいて優れた結果を達成します。さらに、当社の教科書で事前トレーニングされたVLMは、タスク解決のためのフューショットコンテキストで視覚的およびテキストの手がかりを活用した優れた交互配置コンテキスト認識を示します。当社のコードは\url{https://github.com/DAMO-NLP-SG/multimodal_textbook}で入手可能です。

English

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality multimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}.

2.5年間のクラス：ビジョン言語のためのマルチモーダル教科書の事前学習

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

要旨

Summary

Support