중국어 SimpleQA: 대규모 언어 모델을 위한 중국어 사실성 평가

초록

대규모 언어 모델(LLM)의 신속한 발전에 부합하는 새로운 LLM 평가 기준은 중요합니다. 본 연구에서는 짧은 질문에 대답하는 언어 모델의 사실성 능력을 평가하기 위한 첫 번째 포괄적인 중국어 벤치마크인 '중국어 SimpleQA'를 제시합니다. 중국어 SimpleQA는 주로 다섯 가지 특성(중국어, 다양성, 고품질, 정적, 쉬운 평가)을 갖추고 있습니다. 구체적으로, 우리는 먼저 6개의 주요 주제와 99가지 다양한 하위 주제에 걸쳐 중국어에 초점을 맞춥니다. 둘째, 고품질의 질문과 답변을 얻기 위해 포괄적인 품질 관리 과정을 거치며, 참고 답변은 정적이며 시간이 지나도 변경되지 않습니다. 셋째, SimpleQA를 따라 질문과 답변이 매우 짧으며, OpenAI API를 기반으로 한 쉬운 평가 과정이 이루어집니다. 중국어 SimpleQA를 기반으로 기존 LLM의 사실성 능력에 대한 포괄적인 평가를 수행합니다. 마지막으로, 중국어 SimpleQA가 개발자들이 모델의 중국어 사실성 능력을 더 잘 이해하고 기초 모델의 성장을 촉진할 수 있기를 희망합니다.

English

New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.

중국어 SimpleQA: 대규모 언어 모델을 위한 중국어 사실성 평가

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

초록

Summary

Support