공유 자원의 유해성: 오픈 소스 사전 훈련 데이터의 선별

초록

오픈 소스 대형 언어 모델은 연구자와 실무자들 사이에서 점점 더 이용 가능하고 인기를 끌고 있습니다. 오픈 가중치 모델에 대한 중요한 발전이 있었지만, 주요 오픈 가중치 모델 개발자들에 의해 아직 채택되지 않은 오픈 훈련 데이터는 한 가지 실천 사례입니다. 동시에 연구자들은 언어 모델을 보다 안전하게 만들기 위해 노력하고 있습니다. 우리는 공개 도메인 데이터로 훈련된 모델의 유해한 결과물을 줄이기 위한 데이터 정제 파이프라인을 제안합니다. 공개 도메인 데이터를 다루는 데는 고유한 도전 과제가 있습니다. 이러한 소스들은 형식과 내용 모두에서 웹 텍스트와 다릅니다. 많은 소스는 역사적 문서이며 광학 문자 인식(OCR)의 결과물입니다. 결과적으로 현재 최첨단 독성 필터링 접근 방식은 종종 오픈 데이터 모델에 대해 실현 가능하지 않거나 적절하지 않습니다. 본 논문에서는 오픈 데이터 독성 필터링을 위한 새로운 완전한 오픈 소스 파이프라인을 소개합니다. 우리의 기여는 세 가지입니다. 우리는 다섯 가지 다른 차원(인종/출신, 성별/성적, 종교, 능력에 기반한 차별, 폭력)을 통해 분류된 텍스트로 구성된 사용자 정의 훈련 데이터 세트인 ToxicCommons를 생성합니다. 이 데이터 세트를 사용하여 오픈 데이터에서 독성 콘텐츠를 더 효율적으로 대규모로 감지할 수 있는 사용자 정의 분류기인 Celadon을 훈련합니다. 마지막으로, 훈련용으로 사용 가능한 필터링된 데이터에 대한 안전 필터링을 최적화하는 균형 잡힌 콘텐츠 필터링 방식을 설명합니다.

English

Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

공유 자원의 유해성: 오픈 소스 사전 훈련 데이터의 선별

Toxicity of the Commons: Curating Open-Source Pre-Training Data

초록

Summary

Support