AAAR-1.0：评估人工智能辅助研究的潜力

摘要

许多研究已经评估了人工智能系统的熟练程度，特别是大型语言模型（LLMs），在促进诸如电子邮件撰写、问题回答和创意内容生成等日常任务方面的作用。然而，研究人员在利用LLMs进行自身工作时面临着独特的挑战和机遇，例如构思研究思路、设计实验以及撰写或审阅论文。在本研究中，我们介绍了AAAR-1.0，这是一个旨在评估LLM在三项基础、专业密集型研究任务中的表现的基准数据集：（i）EquationInference，根据论文提交中的上下文信息评估方程式的正确性；（ii）ExperimentDesign，设计实验以验证研究思路和解决方案；（iii）PaperWeakness，识别论文提交中的弱点；以及（iv）REVIEWCRITIQUE，识别人类审阅中每个部分是否存在缺陷。AAAR-1.0在两个关键方面与先前的基准数据集不同：首先，它明确以研究为导向，任务需要深入的领域专业知识；其次，它以研究人员为导向，反映了研究人员日常主要活动。对开源和专有LLMs的评估揭示了它们在进行复杂研究任务中的潜力以及局限性。我们将继续将AAAR-1.0迭代至新版本。

English

Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.

AAAR-1.0：评估人工智能辅助研究的潜力

AAAR-1.0: Assessing AI's Potential to Assist Research

摘要

Summary

Support

Support