AfriHate: 아프리카 언어를 위한 혐오 발언과 학대 언어의 다중 언어 데이터셋

초록

혐오 발언과 남용 언어는 사회 문화적 배경 지식이 필요하여 이해, 식별 및 조절되어야 하는 전 세계적 현상이다. 그러나 세계 남쪽의 많은 지역에서는 (1) 조절 부재 및 (2) 맥락을 빼고 키워드 감지에 의존하여 검열이 이루어진 사례가 여러 건 기록되어 왔다. 더 나아가, 유명 인물들이 조절 과정의 중심에 자주 있었으며, 소수자에 대한 대규모 및 표적형 혐오 발언 캠페인은 무시되어 왔다. 이러한 한계는 주로 현지 언어의 고품질 데이터 부족과 현지 커뮤니티를 데이터 수집, 주석 및 조절 과정에 포함시키지 못한 것에서 비롯된다. 이 문제에 대응하기 위해, 우리는 AfriHate를 제시한다: 15개의 아프리카 언어로 된 혐오 발언과 남용 언어 데이터셋의 다중 언어 모음이다. AfriHate의 각 사례는 현지 문화에 익숙한 원어민들에 의해 주석이 달렸다. 데이터셋 구축과 관련된 도전 과제를 보고, LLMs를 사용하거나 사용하지 않은 다양한 분류 기준 결과를 제시한다. 데이터셋, 개별 주석 및 혐오 발언 및 모욕적 언어 어휘는 https://github.com/AfriHate/AfriHate에서 제공된다.

English

Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate

AfriHate: 아프리카 언어를 위한 혐오 발언과 학대 언어의 다중 언어 데이터셋

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

초록

Support