不可能的測試:一個 2024 年無解的資料集和對 AGI 的機會測驗
The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz
November 20, 2024
作者: David Noever, Forrest McKee
cs.AI
摘要
本研究介紹了一個新穎的評估框架,旨在評估大型語言模型(LLMs)在675個基本無法解決的問題上承認不確定性的能力。使用一個由畢業級大挑戰問題組成的精選數據集,這些問題故意沒有可知答案,我們評估了十二個最先進的LLMs,包括開源和封閉源模型,在承認無法解答問題的傾向方面,而非生成似是而非但不正確的回應。最佳模型在生物學、哲學和數學等領域承認問題解決方案未知的準確率範圍為62-68%。我們觀察到問題難度與模型準確性之間呈反向關係,GPT-4在更具挑戰性的問題上(35.8%)表現出更高的不確定性承認率,相對於較簡單的問題(20.0%)。這種模式表明,當問題看似更易處理時,模型可能更容易生成推測性答案。研究還揭示了問題類別之間的顯著變化,模型在承認不確定性方面在發明和NP-hard問題上遇到困難,而在哲學和心理挑戰上表現相對較好。這些結果有助於人工通用智能(AGI)評估研究的不斷增長,突顯了承認不確定性作為未來機器智能評估的關鍵組成部分的重要性。這種不可能性測試擴展了以前的通用智能測試的理論框架,通過提供LLMs在識別自身知識邊界方面目前的限制的實證證據,提出了改進模型訓練架構和評估方法的新方向。
English
This research introduces a novel evaluation framework designed to assess
large language models' (LLMs) ability to acknowledge uncertainty on 675
fundamentally unsolvable problems. Using a curated dataset of graduate-level
grand challenge questions with intentionally unknowable answers, we evaluated
twelve state-of-the-art LLMs, including both open and closed-source models, on
their propensity to admit ignorance rather than generate plausible but
incorrect responses. The best models scored in 62-68% accuracy ranges for
admitting the problem solution was unknown in fields ranging from biology to
philosophy and mathematics. We observed an inverse relationship between problem
difficulty and model accuracy, with GPT-4 demonstrating higher rates of
uncertainty acknowledgment on more challenging problems (35.8%) compared to
simpler ones (20.0%). This pattern indicates that models may be more prone to
generate speculative answers when problems appear more tractable. The study
also revealed significant variations across problem categories, with models
showing difficulty in acknowledging uncertainty in invention and NP-hard
problems while performing relatively better on philosophical and psychological
challenges. These results contribute to the growing body of research on
artificial general intelligence (AGI) assessment by highlighting the importance
of uncertainty recognition as a critical component of future machine
intelligence evaluation. This impossibility test thus extends previous
theoretical frameworks for universal intelligence testing by providing
empirical evidence of current limitations in LLMs' ability to recognize their
own knowledge boundaries, suggesting new directions for improving model
training architectures and evaluation approaches.Summary
AI-Generated Summary