中文SimpleQA：大型語言模型的中文真實性評估

摘要

為了與大型語言模型（LLMs）的快速發展保持一致，新的LLM評估基準至關重要。在這項工作中，我們提出了中文SimpleQA，這是第一個全面的中文基準，用於評估語言模型回答簡短問題的事實能力。中文SimpleQA主要具有五個特點（即中文、多樣性、高質量、靜態、易於評估）。具體而言，首先，我們聚焦於六個主題的中文語言，涵蓋99個多樣的子主題。其次，我們進行全面的質量控制過程，以確保問題和答案的高質量，參考答案是靜態的，不會隨時間改變。第三，與SimpleQA相似，問題和答案非常簡短，評分過程基於OpenAI API，易於評估。基於中文SimpleQA，我們對現有LLMs的事實能力進行了全面評估。最後，我們希望中文SimpleQA能夠引導開發人員更好地了解其模型的中文事實能力，促進基礎模型的發展。

English

New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.

中文SimpleQA：大型語言模型的中文真實性評估

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

摘要

Summary

Support