ChatPaper.aiChatPaper

语言模型能否进行证伪?通过反例生成评估算法推理能力

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

February 26, 2025
作者: Shiven Sinha, Shashwat Goel, Ponnurangam Kumaraguru, Jonas Geiping, Matthias Bethge, Ameya Prabhu
cs.AI

摘要

人们对语言模型(LMs)加速科学发现的潜力日益感到振奋。证伪假设是科学进步的关键,因为它使得主张能够随时间迭代精炼。这一过程需要研究者投入大量精力、运用推理与创造力。然而,当前对LMs的基准测试主要评估其生成解决方案的能力,而非挑战这些方案。我们主张开发能够评估这种逆向能力的基准——即为微妙错误的解决方案构建反例。为展示这一方法,我们首先聚焦算法问题解决领域,其中反例可通过代码执行自动评估。具体而言,我们引入了REFUTE,一个动态更新的基准,包含来自编程竞赛的最新问题及错误提交,这些错误提交已由人类专家成功识别出反例。我们的分析发现,即便是配备了代码执行反馈的OpenAI o3-mini(高)这样的顶级推理代理,也只能为REFUTE中不到9%的错误解决方案创建反例,尽管其评分显示它能够从零开始解决高达48%的这些问题。我们期望本工作能推动在评估和增强LMs证伪错误解决方案能力方面的进展——这一能力对于加速研究及通过可靠的反思推理实现模型自我提升至关重要。
English
There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only <9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs' ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.

Summary

AI-Generated Summary

PDF182February 27, 2025