Commons的有毒性：筛选开源预训练数据

摘要

开源大型语言模型正在逐渐变得更加普遍且受到研究人员和从业者的欢迎。虽然在开放权重模型方面已经取得了显著进展，但开放训练数据仍然是主要开放权重模型创建者尚未采纳的做法。与此同时，研究人员正在努力使语言模型更加安全。我们提出了一个数据筛选流程，以减少在公共领域数据训练的模型产生有害输出。在处理公共领域数据时存在独特挑战，因为这些来源在形式和内容上与网络文本不同。许多来源是历史文档，并且是光学字符识别（OCR）的结果。因此，目前最先进的有毒性过滤方法通常对于开放数据模型来说是不可行或不适当的。在本文中，我们介绍了一个新的完全开源的用于开放数据有毒性过滤的流程。我们的贡献有三个方面。我们创建了一个自定义训练数据集，ToxicCommons，其中包含根据五个不同维度（基于种族/原始、基于性别/性别、宗教、基于能力的歧视和暴力）对文本进行分类的文本。我们使用这个数据集来训练一个自定义分类器，Celadon，可以更高效地在更大规模上检测开放数据中的有害内容。最后，我们描述了一种平衡的内容过滤方法，该方法优化了与用于训练的过滤数据相关的安全过滤。

English

Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

Commons的有毒性：筛选开源预训练数据

Toxicity of the Commons: Curating Open-Source Pre-Training Data

摘要

Summary

Support

Support