당신의 LLM은 안정적인 추론이 가능한가요?

초록

대형 언어 모델(LLMs)의 신속한 발전은 복잡한 추론 작업에서의 현저한 진전을 입증했습니다. 그러나 벤치마크 성능과 실제 응용 프로그램 간에는 상당한 차이가 남아 있습니다. 우리는 현재의 평가 프로토콜과 측정 지표가 LLM 능력의 전체 스펙트럼을 부적절하게 포착하여 주로 이러한 간극에서 비롯된다고 판단합니다. 특히 정확도와 일관성이 모두 중요한 복잡한 추론 작업에서 LLM 능력을 측정하는 데 부족함이 있습니다. 본 연구는 두 가지 주요 기여를 합니다. 첫째, 우리는 G-Pass@k라는 새로운 평가 지표를 소개합니다. 이는 다중 샘플링 시도를 통해 모델 성능을 지속적으로 평가하여 모델의 최대 성능 가능성과 안정성을 양적으로 측정합니다. 둘째, 우리는 데이터 유출 위험을 최소화하기 위해 설계된 도전적이고 현대적인 수학 문제로 구성된 동적 벤치마크인 LiveMathBench를 제시합니다. G-Pass@k를 사용하여 최신 LLM에서 LiveMathBench를 통해 광범위한 실험을 통해 그들의 최대 능력과 운영 일관성에 대한 포괄적인 통찰력을 제공합니다. 우리의 연구 결과는 LLM의 "현실적" 추론 능력에 대한 상당한 향상 여지를 보여주며, 보다 견고한 평가 방법의 필요성을 강조합니다. 벤치마크 및 상세 결과는 다음에서 확인할 수 있습니다: https://github.com/open-compass/GPassK.

English

The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap as primarily stemming from current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, particularly in complex reasoning tasks where both accuracy and consistency are crucial. This work makes two key contributions. First, we introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model's peak performance potential and its stability. Second, we present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments using G-Pass@k on state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights into both their maximum capabilities and operational consistency. Our findings reveal substantial room for improvement in LLMs' "realistic" reasoning capabilities, highlighting the need for more robust evaluation methods. The benchmark and detailed results are available at: https://github.com/open-compass/GPassK.

당신의 LLM은 안정적인 추론이 가능한가요?

Are Your LLMs Capable of Stable Reasoning?

초록

Support