中文SimpleQA:大型语言模型的中文事实性评估
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
November 11, 2024
作者: Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Hui Huang, Weixun Wang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Xuepeng Liu, Dekai Sun, Wenbo Su, Bo Zheng
cs.AI
摘要
新的LLM评估基准对齐与大型语言模型(LLMs)的快速发展至关重要。在这项工作中,我们提出了Chinese SimpleQA,这是第一个全面的中文基准,用于评估语言模型回答简短问题的事实能力,Chinese SimpleQA主要具有五个特性(即中文、多样性、高质量、静态、易评估)。具体来说,首先,我们专注于涵盖6个主要主题和99个多样化子主题的中文语言。其次,我们进行全面的质量控制过程,以获得高质量的问题和答案,其中参考答案是静态的,随时间不变。第三,遵循SimpleQA,问题和答案非常简短,评分过程基于OpenAI API易于评估。基于Chinese SimpleQA,我们对现有LLMs的事实能力进行了全面评估。最后,我们希望Chinese SimpleQA能指导开发人员更好地了解其模型的中文事实能力,并促进基础模型的发展。
English
New LLM evaluation benchmarks are important to align with the rapid
development of Large Language Models (LLMs). In this work, we present Chinese
SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality
ability of language models to answer short questions, and Chinese SimpleQA
mainly has five properties (i.e., Chinese, Diverse, High-quality, Static,
Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6
major topics with 99 diverse subtopics. Second, we conduct a comprehensive
quality control process to achieve high-quality questions and answers, where
the reference answers are static and cannot be changed over time. Third,
following SimpleQA, the questions and answers are very short, and the grading
process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we
perform a comprehensive evaluation on the factuality abilities of existing
LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to
better understand the Chinese factuality abilities of their models and
facilitate the growth of foundation models.Summary
AI-Generated Summary