RedPajama:用於訓練大型語言模型的開放數據集

RedPajama: an Open Dataset for Training Large Language Models

November 19, 2024
作者: Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, Ce Zhang
cs.AI

摘要

大型語言模型正日益成為人工智慧、科學和整個社會的基石技術,然而有關數據集組成和過濾的最佳策略仍然大多數不明確。許多表現優異的模型在其數據策劃和模型開發過程中缺乏透明度,這對於全面開放的語言模型的發展構成障礙。在本文中,我們確定了必須解決的三個與數據相關的核心挑戰,以推進開源語言模型。這些挑戰包括:(1) 模型開發的透明度,包括數據策劃過程,(2) 獲取大量高質量數據,以及 (3) 提供用於數據策劃和分析的工件和元數據。為了應對這些挑戰,我們發布了RedPajama-V1,這是LLaMA訓練數據集的開放再現。此外,我們還發布了RedPajama-V2,這是一個龐大的僅限網絡的數據集,包括原始、未過濾的文本數據,以及質量信號和元數據。這兩個RedPajama數據集總共包含超過100萬億標記,涵蓋多個領域,其質量信號有助於數據的過濾,旨在激發眾多新數據集的開發。迄今為止,這些數據集已經在生產中使用的強大語言模型的訓練中被使用,例如Snowflake Arctic、Salesforce的XGen和AI2的OLMo。為了提供有關RedPajama質量的見解,我們提出了一系列分析和消融研究,使用了高達16億參數的僅解碼器語言模型。我們的研究結果顯示,網絡數據的質量信號可以有效地利用來策劃數據的高質量子集,突顯了RedPajama在推動透明和高性能語言模型大規模發展方面的潛力。
English
Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.

Summary

AI-Generated Summary

PDF493November 20, 2024