SuperGPQA: 285개 대학원 학문 분야에 걸친 대형 언어 모델 평가 확장

초록

대형 언어 모델(LLM)은 수학, 물리학, 컴퓨터 과학과 같은 주류 학문 분야에서 뛰어난 능력을 보여왔습니다. 그러나 인간의 지식은 200개가 넘는 전문 분야를 포괄하며, 이는 기존 벤치마크의 범위를 훨씬 초과합니다. 특히 경공업, 농업, 서비스 지향 분야를 포함한 많은 전문 분야에서 LLM의 능력은 아직 충분히 평가되지 못했습니다. 이러한 격차를 해소하기 위해, 우리는 285개 학문 분야에 걸친 대학원 수준의 지식과 추론 능력을 평가하는 포괄적인 벤치마크인 SuperGPQA를 제안합니다. 우리의 벤치마크는 LLM 응답과 전문가 피드백을 기반으로 반복적인 개선을 통해 사소하거나 모호한 질문을 제거하는 새로운 인간-LLM 협업 필터링 메커니즘을 사용합니다. 실험 결과, 다양한 지식 영역에서 최첨단 LLM의 성능이 개선될 여지가 크다는 것을 보여주었습니다(예: 추론 중심 모델인 DeepSeek-R1은 SuperGPQA에서 61.82%의 최고 정확도를 달성). 이는 현재 모델의 능력과 인공 일반 지능(AGI) 사이의 상당한 격차를 강조합니다. 또한, 우리는 80명 이상의 전문가 주석자와 인간-LLM 협업 시스템을 포함한 대규모 주석 프로세스 관리에서 얻은 포괄적인 통찰을 제시하며, 향후 유사한 규모의 연구 프로젝트에 대한 귀중한 방법론적 지침을 제공합니다.

English

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

SuperGPQA: 285개 대학원 학문 분야에 걸친 대형 언어 모델 평가 확장

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

초록

Support