PhysReason：一个面向基于物理的推理的全面基准。

摘要

大型语言模型展示了在各个领域，尤其是数学和逻辑推理方面的显著能力。然而，当前的评估忽视了基于物理的推理——这是一个复杂的任务，需要物理定理和约束条件。我们提出了PhysReason，一个包含1,200个问题的基准测试，其中包括基于知识的问题（25%）和基于推理的问题（75%），后者分为三个难度级别（简单、中等、困难）。值得注意的是，这些问题平均需要8.1个解决步骤，其中困难问题需要15.6个步骤，反映了基于物理的推理的复杂性。我们提出了物理解决方案自动评分框架，结合了高效的答案级和全面的步骤级评估。像Deepseek-R1、Gemini-2.0-Flash-Thinking和o3-mini-high等表现最佳的模型在答案级评估中不到60%，性能从知识问题（75.11%）下降到困难问题（31.95%）。通过步骤级评估，我们确定了四个关键瓶颈：物理定理应用、物理过程理解、计算和物理条件分析。这些发现将PhysReason定位为评估大型语言模型中基于物理推理能力的新颖而全面的基准测试。我们的代码和数据将发布在https:/dxzxy12138.github.io/PhysReason。

English

Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https:/dxzxy12138.github.io/PhysReason.

PhysReason：一个面向基于物理的推理的全面基准。

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

摘要

Summary

Support