AfriHate:一个包含非洲语言仇恨言论和辱骂性语言的多语言数据集
AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages
January 14, 2025
作者: Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Saminu Mohammad Aliyu, Nelson Odhiambo Onyango, Lilian D. A. Wanzare, Samuel Rutunda, Lukman Jibril Aliyu, Esubalew Alemneh, Oumaima Hourrane, Hagos Tesfahun Gebremichael, Elyas Abdi Ismail, Meriem Beloucif, Ebrahim Chekol Jibril, Andiswa Bukula, Rooweither Mabuya, Salomey Osei, Abigail Oppong, Tadesse Destaw Belay, Tadesse Kebede Guge, Tesfa Tegegne Asfaw, Chiamaka Ijeoma Chukwuneke, Paul Röttger, Seid Muhie Yimam, Nedjma Ousidhoum
cs.AI
摘要
仇恨言论和辱骂性语言是全球性现象,需要社会文化背景知识才能理解、识别和管理。然而,在全球南方的许多地区,已经有多起记录的事件表明存在(1)缺乏管理和(2)由于依赖上下文之外的关键词识别而进行审查。此外,高知名度个人经常处于管理过程的中心,而针对少数群体的大规模和有针对性的仇恨言论活动却被忽视。这些限制主要是由于缺乏本地语言的高质量数据以及未能将本地社区纳入数据收集、标注和管理过程所致。为了解决这一问题,我们提出了AfriHate:一个包含15种非洲语言的仇恨言论和辱骂性语言数据集的多语言集合。AfriHate中的每个实例都由熟悉当地文化的母语人士进行标注。我们报告了与数据集构建相关的挑战,并展示了使用和不使用LLMs的各种分类基线结果。这些数据集、个别标注以及仇恨言论和冒犯性语言词汇表可在https://github.com/AfriHate/AfriHate 上获得。
English
Hate speech and abusive language are global phenomena that need
socio-cultural background knowledge to be understood, identified, and
moderated. However, in many regions of the Global South, there have been
several documented occurrences of (1) absence of moderation and (2) censorship
due to the reliance on keyword spotting out of context. Further, high-profile
individuals have frequently been at the center of the moderation process, while
large and targeted hate speech campaigns against minorities have been
overlooked. These limitations are mainly due to the lack of high-quality data
in the local languages and the failure to include local communities in the
collection, annotation, and moderation processes. To address this issue, we
present AfriHate: a multilingual collection of hate speech and abusive language
datasets in 15 African languages. Each instance in AfriHate is annotated by
native speakers familiar with the local culture. We report the challenges
related to the construction of the datasets and present various classification
baseline results with and without using LLMs. The datasets, individual
annotations, and hate speech and offensive language lexicons are available on
https://github.com/AfriHate/AfriHateSummary
AI-Generated Summary