ChatPaper.aiChatPaper

无需博士知识:大型语言模型的推理挑战

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

February 3, 2025
作者: Carolyn Jane Anderson, Joydeep Biswas, Aleksander Boruch-Gruszecki, Federico Cassano, Molly Q Feldman, Arjun Guha, Francesca Lucchetti, Zixuan Wu
cs.AI

摘要

现有的前沿模型基准往往测试专业化的“博士级”知识,这对非专家来说很难理解。相比之下,我们提出了一个基于NPR Sunday Puzzle Challenge的基准,只需要一般性知识。我们的基准对人类和模型都具有挑战性,然而正确的解决方案易于验证,模型的错误也容易发现。我们的工作揭示了现有基准中未曾显现的能力差距:OpenAI o1在测试专业知识的基准上明显优于其他推理模型。此外,我们对推理输出的分析揭示了新类型的失败。例如,DeepSeek R1在提供一个明知错误的答案之前经常会放弃并表示“我放弃”。R1的输出也可能非常“不确定”,在极少数情况下,它可能不“思考完毕”,这表明需要一种推理时技术在达到上下文窗口限制之前“收尾”。我们还量化了使用R1和Gemini Thinking进行更长推理的有效性,以确定超过某一点后进一步推理不太可能提高我们基准的准确性。
English
Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.

Summary

AI-Generated Summary

PDF96February 4, 2025