你的LLM能夠穩定推理嗎?

Are Your LLMs Capable of Stable Reasoning?

December 17, 2024
作者: Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen
cs.AI

摘要

大型語言模型(LLMs)的快速發展展示了在複雜推理任務中取得的顯著進展。然而,在基準性能與實際應用之間仍存在顯著差異。我們認為這一差距主要源於當前的評估協議和指標,無法充分捕捉LLM能力的全部範疇,尤其是在複雜推理任務中,準確性和一致性同樣重要。本研究有兩個主要貢獻。首先,我們引入了G-Pass@k,一個新穎的評估指標,可跨越多次取樣試驗持續評估模型表現,量化模型的最高性能潛力和穩定性。其次,我們提出了LiveMathBench,一個動態基準,包含挑戰性的當代數學問題,旨在在評估過程中最小化數據泄漏風險。通過在最先進的LLMs上使用G-Pass@k和LiveMathBench進行廣泛實驗,我們全面了解了它們的最大能力和運行一致性。我們的研究顯示LLMs在“實際”推理能力方面仍有很大改進空間,突顯了對更強大的評估方法的需求。基準和詳細結果可在以下網址找到:https://github.com/open-compass/GPassK。
English
The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap as primarily stemming from current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, particularly in complex reasoning tasks where both accuracy and consistency are crucial. This work makes two key contributions. First, we introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model's peak performance potential and its stability. Second, we present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments using G-Pass@k on state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights into both their maximum capabilities and operational consistency. Our findings reveal substantial room for improvement in LLMs' "realistic" reasoning capabilities, highlighting the need for more robust evaluation methods. The benchmark and detailed results are available at: https://github.com/open-compass/GPassK.

Summary

AI-Generated Summary

PDF913December 18, 2024