U-MATH:用于评估LLM中数学技能的大学水平基准。
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
December 4, 2024
作者: Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga
cs.AI
摘要
目前对LLM的数学技能评估存在局限性,因为现有的基准要么规模相对较小,主要关注初中和高中问题,要么在主题上缺乏多样性。此外,在任务中包含视觉元素的做法仍然大多未被充分探讨。
为了解决这些缺口,我们引入了U-MATH,这是一个新颖的基准,包含1,100个未发表的开放式大学级问题,这些问题来自教材。它在六个核心学科上平衡,其中20%是多模态问题。鉴于U-MATH问题的开放性质,我们使用LLM来判断生成的解决方案的正确性。为此,我们发布了mu-MATH,这是一个用于评估LLM在判断解决方案能力的数据集。
对通用领域、数学特定领域和多模态LLM的评估突显了U-MATH所带来的挑战。我们的研究结果显示,LLM在基于文本的任务上仅能达到最高63%的准确率,甚至在视觉问题上只有45%的准确率。对LLM来说,解决方案评估是具有挑战性的,最佳LLM评判者在mu-MATH上的F1分数为80%。
English
The current evaluation of mathematical skills in LLMs is limited, as existing
benchmarks are either relatively small, primarily focus on elementary and
high-school problems, or lack diversity in topics. Additionally, the inclusion
of visual elements in tasks remains largely under-explored.
To address these gaps, we introduce U-MATH, a novel benchmark of 1,100
unpublished open-ended university-level problems sourced from teaching
materials. It is balanced across six core subjects, with 20% of multimodal
problems. Given the open-ended nature of U-MATH problems, we employ an LLM to
judge the correctness of generated solutions. To this end, we release
mu-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions.
The evaluation of general domain, math-specific, and multimodal LLMs
highlights the challenges presented by U-MATH. Our findings reveal that LLMs
achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45%
on visual problems. The solution assessment proves challenging for LLMs, with
the best LLM judge having an F1-score of 80% on mu-MATH.Summary
AI-Generated Summary