U-MATH:用於評估LLM中數學技能的大學級基準
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
December 4, 2024
作者: Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga
cs.AI
摘要
目前對LLM的數學技能評估有限,因為現有的基準要麼規模較小,主要關注初中和高中問題,要麼在主題上缺乏多樣性。此外,在任務中包含視覺元素的探討仍然相對不足。
為了解決這些缺口,我們引入了U-MATH,這是一個新穎的基準,包含了1,100個未發表的大學級開放性問題,來源於教材。它在六個核心科目之間平衡,其中有20%的多模式問題。鑒於U-MATH問題的開放性,我們使用LLM來判斷生成解決方案的正確性。為此,我們發布了mu-MATH,這是一個用於評估LLM在判斷解決方案能力的數據集。
對通用領域、數學特定和多模式LLM的評估突顯了U-MATH所帶來的挑戰。我們的研究發現,LLM在基於文本的任務上僅達到63%的最高準確率,視覺問題的準確率甚至更低,只有45%。對LLM來說,解決方案評估是具有挑戰性的,最佳的LLM評審在mu-MATH上的F1分數為80%。
English
The current evaluation of mathematical skills in LLMs is limited, as existing
benchmarks are either relatively small, primarily focus on elementary and
high-school problems, or lack diversity in topics. Additionally, the inclusion
of visual elements in tasks remains largely under-explored.
To address these gaps, we introduce U-MATH, a novel benchmark of 1,100
unpublished open-ended university-level problems sourced from teaching
materials. It is balanced across six core subjects, with 20% of multimodal
problems. Given the open-ended nature of U-MATH problems, we employ an LLM to
judge the correctness of generated solutions. To this end, we release
mu-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions.
The evaluation of general domain, math-specific, and multimodal LLMs
highlights the challenges presented by U-MATH. Our findings reveal that LLMs
achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45%
on visual problems. The solution assessment proves challenging for LLMs, with
the best LLM judge having an F1-score of 80% on mu-MATH.Summary
AI-Generated Summary