ChatPaper.aiChatPaper

全球MMLU:理解和解决多语言评估中的文化和语言偏见

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

December 4, 2024
作者: Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, Sara Hooker
cs.AI

摘要

多语言数据集中的文化偏见对其作为全球基准的有效性构成重大挑战。这些偏见不仅源自语言,还来自解释问题所需的文化知识,降低了诸如MMLU之类的翻译数据集的实际效用。此外,翻译往往会引入可能扭曲目标语言问题意义或清晰度的人为因素。在多语言评估中的常见做法是依赖机器翻译的评估集,但仅仅翻译数据集是不足以解决这些挑战的。在这项工作中,我们追踪这两个问题对多语言评估和随后模型表现的影响。我们对最先进的开源和专有模型进行的大规模评估表明,在MMLU上取得进展在很大程度上取决于学习西方中心概念,其中28%的问题需要文化敏感知识。此外,对于需要地理知识的问题,令人惊讶的84.9%关注北美或欧洲地区。模型评估的排名会根据是在全部问题上评估还是在标记为文化敏感的子集上评估而改变,显示了在盲目依赖翻译MMLU时对模型排名的扭曲。我们发布了Global-MMLU,这是一个改进的MMLU,覆盖了42种语言的评估范围 -- 通过与获得补偿的专业和社区标注者合作验证翻译质量,同时严格评估原始数据集中存在的文化偏见,从而提高了整体质量。这个全面的Global-MMLU数据集还包括被标记为文化敏感和文化不可知的指定子集,以便进行更全面、完整的评估。
English
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks. These biases stem not only from language but also from the cultural knowledge required to interpret questions, reducing the practical utility of translated datasets like MMLU. Furthermore, translation often introduces artifacts that can distort the meaning or clarity of questions in the target language. A common practice in multilingual evaluation is to rely on machine-translated evaluation sets, but simply translating a dataset is insufficient to address these challenges. In this work, we trace the impact of both of these issues on multilingual evaluations and ensuing model performances. Our large-scale evaluation of state-of-the-art open and proprietary models illustrates that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge. Moreover, for questions requiring geographic knowledge, an astounding 84.9% focus on either North American or European regions. Rankings of model evaluations change depending on whether they are evaluated on the full portion or the subset of questions annotated as culturally sensitive, showing the distortion to model rankings when blindly relying on translated MMLU. We release Global-MMLU, an improved MMLU with evaluation coverage across 42 languages -- with improved overall quality by engaging with compensated professional and community annotators to verify translation quality while also rigorously evaluating cultural biases present in the original dataset. This comprehensive Global-MMLU set also includes designated subsets labeled as culturally sensitive and culturally agnostic to allow for more holistic, complete evaluation.

Summary

AI-Generated Summary

PDF192December 6, 2024