你的LLMs是否能够稳定推理?
Are Your LLMs Capable of Stable Reasoning?
December 17, 2024
作者: Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen
cs.AI
摘要
大型语言模型(LLMs)的快速发展展示了在复杂推理任务中取得的显著进展。然而,在基准性能和实际应用之间仍然存在显著差距。我们确定这一差距主要源自当前的评估协议和指标,这些协议和指标未能充分捕捉LLM能力的全部范围,特别是在需要准确性和一致性的复杂推理任务中。本文提出了两个关键贡献。首先,我们引入了G-Pass@k,这是一种新颖的评估指标,可通过多次采样尝试持续评估模型性能,量化模型的峰值性能潜力和稳定性。其次,我们提出了LiveMathBench,这是一个动态基准,包含设计用于在评估过程中最小化数据泄漏风险的具有挑战性的当代数学问题。通过在最先进的LLMs上使用G-Pass@k在LiveMathBench上进行广泛实验,我们全面了解了它们的最大能力和运行一致性。我们的研究结果揭示了LLMs在“现实”推理能力方面有很大改进空间,突出了对更强大的评估方法的需求。基准和详细结果可在以下网址找到:https://github.com/open-compass/GPassK。
English
The rapid advancement of Large Language Models (LLMs) has demonstrated
remarkable progress in complex reasoning tasks. However, a significant
discrepancy persists between benchmark performances and real-world
applications. We identify this gap as primarily stemming from current
evaluation protocols and metrics, which inadequately capture the full spectrum
of LLM capabilities, particularly in complex reasoning tasks where both
accuracy and consistency are crucial. This work makes two key contributions.
First, we introduce G-Pass@k, a novel evaluation metric that provides a
continuous assessment of model performance across multiple sampling attempts,
quantifying both the model's peak performance potential and its stability.
Second, we present LiveMathBench, a dynamic benchmark comprising challenging,
contemporary mathematical problems designed to minimize data leakage risks
during evaluation. Through extensive experiments using G-Pass@k on
state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights
into both their maximum capabilities and operational consistency. Our findings
reveal substantial room for improvement in LLMs' "realistic" reasoning
capabilities, highlighting the need for more robust evaluation methods. The
benchmark and detailed results are available at:
https://github.com/open-compass/GPassK.Summary
AI-Generated Summary