ChatPaper.aiChatPaper

Commons中的有害性:策劃開源預訓練數據

Toxicity of the Commons: Curating Open-Source Pre-Training Data

October 29, 2024
作者: Catherine Arnett, Eliot Jones, Ivan P. Yamshchikov, Pierre-Carl Langlais
cs.AI

摘要

開源大型語言模型在研究人員和從業者中越來越受歡迎並且變得越來越可用。雖然在開放權重模型上取得了重大進展,但領先的開放權重模型創建者尚未採用開放訓練數據的做法。與此同時,研究人員正致力於使語言模型更安全。我們提出了一個數據策劃流程,以減少基於公共領域數據訓練的模型產生的有害輸出。與網絡文本不同,使用公共領域數據存在獨特挑戰,因為這些來源在形式和內容上有所不同。許多來源是歷史文獻,是光學字符識別(OCR)的結果。因此,目前最先進的有毒過濾方法通常對於開放數據模型來說不切實際或不適用。在本文中,我們介紹了一個新的完全開源的流程,用於開放數據的有毒過濾。我們的貢獻有三個方面。我們創建了一個自定義訓練數據集 ToxicCommons,其中包含根據五個不同維度(種族/起源、性別/性別、宗教、能力歧視和暴力)進行分類的文本。我們使用這個數據集來訓練一個自定義分類器 Celadon,可以更有效地在更大範圍內檢測開放數據中的有害內容。最後,我們描述了一種平衡的內容過濾方法,優化了與用於訓練的過濾數據相關的安全過濾。
English
Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

Summary

AI-Generated Summary

PDF102November 16, 2024