RedPajama:用于训练大型语言模型的开放数据集

RedPajama: an Open Dataset for Training Large Language Models

November 19, 2024
作者: Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, Ce Zhang
cs.AI

摘要

大型语言模型正日益成为人工智能、科学和整个社会的基石技术,然而有关数据集构成和过滤的最佳策略仍然大多数情况下难以捉摸。许多表现最佳的模型在数据集策划和模型开发过程中缺乏透明度,这对于全面开放的语言模型的发展构成了障碍。本文中,我们确定了必须解决的三个核心与数据相关的挑战,以推动开源语言模型的发展。这些挑战包括:(1)模型开发中的透明度,包括数据策划过程,(2)获取大量高质量数据,以及(3)数据集策划和分析的文物和元数据的可用性。为了解决这些挑战,我们发布了RedPajama-V1,LLaMA训练数据集的开放复制品。此外,我们发布了RedPajama-V2,一个庞大的仅限网络的数据集,包含原始、未经过滤的文本数据,以及质量信号和元数据。这两个RedPajama数据集共涵盖超过100万亿标记,涵盖多个领域,其质量信号有助于数据的过滤,旨在激发众多新数据集的开发。迄今为止,这些数据集已被用于训练用于生产的强大语言模型,如Snowflake Arctic、Salesforce的XGen和AI2的OLMo。为了提供关于RedPajama质量的见解,我们使用高达16亿参数的仅解码器语言模型进行了一系列分析和消融研究。我们的发现表明,网络数据的质量信号可以有效利用,以策划数据集的高质量子集,突显了RedPajama在推动透明和高性能语言模型大规模发展方面的潜力。
English
Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.

Summary

AI-Generated Summary

PDF343November 20, 2024