不可能的测试:一个2024年无法解决的数据集和对AGI的机会测验
The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz
November 20, 2024
作者: David Noever, Forrest McKee
cs.AI
摘要
本研究介绍了一种新颖的评估框架,旨在评估大型语言模型(LLMs)在675个基本无法解决的问题上承认不确定性的能力。利用一个策划的数据集,包含有意无法知晓答案的研究生级别的重大挑战问题,我们评估了十二种最先进的LLMs,包括开源和闭源模型,在承认无法解答问题的倾向方面,而非生成似是而非但不正确的回答。最佳模型在承认问题解决方案未知的准确率范围为62-68%,涵盖生物学、哲学和数学等领域。我们观察到问题难度与模型准确性之间存在反比关系,GPT-4在更具挑战性的问题上(35.8%)表现出更高的不确定性承认率,而在较简单的问题上(20.0%)表现较低。这种模式表明,当问题看似更易解决时,模型可能更容易生成推测性答案。研究还揭示了问题类别之间的显著差异,模型在承认不确定性方面在发明和NP难问题上存在困难,而在哲学和心理挑战上表现相对较好。这些结果为人工通用智能(AGI)评估领域的研究增添了内容,强调了承认不确定性作为未来机器智能评估的关键组成部分的重要性。这一不可能性测试通过提供当前LLMs在识别自身知识边界方面的局限性的实证证据,扩展了先前的通用智能测试的理论框架,为改进模型训练架构和评估方法提供了新的方向。
English
This research introduces a novel evaluation framework designed to assess
large language models' (LLMs) ability to acknowledge uncertainty on 675
fundamentally unsolvable problems. Using a curated dataset of graduate-level
grand challenge questions with intentionally unknowable answers, we evaluated
twelve state-of-the-art LLMs, including both open and closed-source models, on
their propensity to admit ignorance rather than generate plausible but
incorrect responses. The best models scored in 62-68% accuracy ranges for
admitting the problem solution was unknown in fields ranging from biology to
philosophy and mathematics. We observed an inverse relationship between problem
difficulty and model accuracy, with GPT-4 demonstrating higher rates of
uncertainty acknowledgment on more challenging problems (35.8%) compared to
simpler ones (20.0%). This pattern indicates that models may be more prone to
generate speculative answers when problems appear more tractable. The study
also revealed significant variations across problem categories, with models
showing difficulty in acknowledging uncertainty in invention and NP-hard
problems while performing relatively better on philosophical and psychological
challenges. These results contribute to the growing body of research on
artificial general intelligence (AGI) assessment by highlighting the importance
of uncertainty recognition as a critical component of future machine
intelligence evaluation. This impossibility test thus extends previous
theoretical frameworks for universal intelligence testing by providing
empirical evidence of current limitations in LLMs' ability to recognize their
own knowledge boundaries, suggesting new directions for improving model
training architectures and evaluation approaches.Summary
AI-Generated Summary