ChatPaper.aiChatPaper

将预训练规模扩展到一千亿数据的视觉语言模型

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

February 11, 2025
作者: Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai
cs.AI

摘要

我们对预训练视觉-语言模型在前所未有的规模上进行了实证研究:1000亿个示例。我们发现,在许多常见的西方中心分类和检索基准上,如COCO Captions,模型性能在这一规模上往往会饱和。然而,涉及文化多样性的任务从这1000亿规模的网络数据中获得了更实质性的收益,这要归功于其覆盖了长尾概念。此外,我们分析了模型的多语言性,并展示了在资源稀缺语言中的收益。此外,我们观察到,通过使用CLIP等质量过滤器减少预训练数据集的规模,通常用于提高性能,可能会无意中减少即使在大规模数据集中也代表的文化多样性。我们的结果突显出,尽管传统基准测试可能不会从将嘈杂的原始网络数据扩展到1000亿示例中受益显著,但这一数据规模对于构建真正包容的多模态系统至关重要。
English
We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model's multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented even in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.

Summary

AI-Generated Summary

PDF294February 12, 2025