ChatPaper.aiChatPaper

MathHay:一個用於長篇數學推理在LLM中的自動化基準。

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

October 7, 2024
作者: Lei Wang, Shan Dong, Yuhui Xu, Hanze Dong, Yalu Wang, Amrita Saha, Ee-Peng Lim, Caiming Xiong, Doyen Sahoo
cs.AI

摘要

近期大型語言模型(LLMs)展示了在長文本情境中的多功能能力。儘管一些最近的基準已經被開發用於評估LLMs的長文本能力,但缺乏評估LLMs在長文本中的數學推理能力的基準,這對於LLMs在實際場景中的應用至關重要。本文介紹了MathHay,一個自動化基準,旨在評估LLMs的長文本數學推理能力。與先前的基準(如針在簸箕中)不同,後者主要聚焦於長文本中的資訊檢索,MathHay要求模型具備資訊尋求和複雜數學推理能力。我們在MathHay上進行了廣泛實驗,以評估八個表現優異的LLMs的長文本數學推理能力。即使是表現最佳的模型Gemini-1.5-Pro-002,在長文本數學推理方面仍然存在困難,在128K tokens時僅達到51.26%的準確率。這突顯了在MathHay基準上有很大的改進空間。
English
Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights the significant room for improvement on the MathHay benchmark.

Summary

AI-Generated Summary

PDF133November 16, 2024