あなたのLLMは安定した推論が可能ですか？

要旨

大規模言語モデル（LLMs）の急速な進歩は、複雑な推論タスクにおける著しい進歩を示しています。しかし、ベンチマークのパフォーマンスと実世界の応用との間には、重要な不一致が依然として存在しています。我々は、このギャップが主に現在の評価プロトコルと指標に起因しており、LLMの能力の全スペクトルを適切に捉えておらず、特に複雑な推論タスクにおいて精度と一貫性の両方が重要である点から来ていると特定しています。この研究は2つの主要な貢献を行っています。まず、複数のサンプリング試行を通じてモデルのパフォーマンスを連続的に評価し、モデルのピークパフォーマンスの可能性と安定性の両方を定量化する新しい評価尺度であるG-Pass@kを導入しています。次に、データ漏洩リスクを最小限に抑えるよう設計された、難解で現代的な数学問題から構成される動的ベンチマークであるLiveMathBenchを提案しています。最先端のLLMsをLiveMathBenchでG-Pass@kを用いて広範な実験を行うことで、それらの最大の能力と操作上の一貫性について包括的な洞察を提供しています。我々の研究結果は、LLMsの「現実的な」推論能力において改善の余地が大きいことを示し、より堅牢な評価手法の必要性を浮き彫りにしています。ベンチマークと詳細な結果は以下から入手可能です：https://github.com/open-compass/GPassK.

English

The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap as primarily stemming from current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, particularly in complex reasoning tasks where both accuracy and consistency are crucial. This work makes two key contributions. First, we introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model's peak performance potential and its stability. Second, we present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments using G-Pass@k on state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights into both their maximum capabilities and operational consistency. Our findings reveal substantial room for improvement in LLMs' "realistic" reasoning capabilities, highlighting the need for more robust evaluation methods. The benchmark and detailed results are available at: https://github.com/open-compass/GPassK.

あなたのLLMは安定した推論が可能ですか？

Are Your LLMs Capable of Stable Reasoning?

要旨

Summary

Support

Support