AfriHate:一个包含非洲语言仇恨言论和辱骂性语言的多语言数据集

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

January 14, 2025
作者: Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Saminu Mohammad Aliyu, Nelson Odhiambo Onyango, Lilian D. A. Wanzare, Samuel Rutunda, Lukman Jibril Aliyu, Esubalew Alemneh, Oumaima Hourrane, Hagos Tesfahun Gebremichael, Elyas Abdi Ismail, Meriem Beloucif, Ebrahim Chekol Jibril, Andiswa Bukula, Rooweither Mabuya, Salomey Osei, Abigail Oppong, Tadesse Destaw Belay, Tadesse Kebede Guge, Tesfa Tegegne Asfaw, Chiamaka Ijeoma Chukwuneke, Paul Röttger, Seid Muhie Yimam, Nedjma Ousidhoum
cs.AI

摘要

仇恨言论和辱骂性语言是全球性现象,需要社会文化背景知识才能理解、识别和管理。然而,在全球南方的许多地区,已经有多起记录的事件表明存在(1)缺乏管理和(2)由于依赖上下文之外的关键词识别而进行审查。此外,高知名度个人经常处于管理过程的中心,而针对少数群体的大规模和有针对性的仇恨言论活动却被忽视。这些限制主要是由于缺乏本地语言的高质量数据以及未能将本地社区纳入数据收集、标注和管理过程所致。为了解决这一问题,我们提出了AfriHate:一个包含15种非洲语言的仇恨言论和辱骂性语言数据集的多语言集合。AfriHate中的每个实例都由熟悉当地文化的母语人士进行标注。我们报告了与数据集构建相关的挑战,并展示了使用和不使用LLMs的各种分类基线结果。这些数据集、个别标注以及仇恨言论和冒犯性语言词汇表可在https://github.com/AfriHate/AfriHate 上获得。
English
Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate

Summary

AI-Generated Summary

PDF52January 15, 2025