PHYBench:大语言模型物理感知与推理能力的全面评估
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
April 22, 2025
作者: Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, Jianxiang Li, Feiyu Tao, Qihua Sun, Zhou Liang, Yushu Mu, Zhongxuan Li, Jing-Jun Zhang, Shutao Zhang, Xiaotian Li, Xingqi Xia, Jiawei Lin, Zheyu Shen, Jiahang Chen, Qiuhao Xiong, Binran Wang, Fengyuan Wang, Ziyang Ni, Bohan Zhang, Fan Cui, Changkun Shao, Qing-Hong Cao, Ming-xing Luo, Muhan Zhang, Hua Xing Zhu
cs.AI
摘要
我们推出了PHYBench,这是一个新颖且高质量的基准测试,旨在评估大型语言模型(LLMs)在物理情境下的推理能力。PHYBench包含500道精心挑选的基于真实世界物理场景的物理问题,旨在评估模型理解和推理现实物理过程的能力。该基准测试涵盖力学、电磁学、热力学、光学、现代物理及高等物理,难度范围从高中练习题到大学物理问题乃至物理奥林匹克竞赛挑战。此外,我们提出了表达式编辑距离(EED)评分,这是一种基于数学表达式间编辑距离的新颖评估指标,能有效捕捉模型推理过程及结果上的差异,超越了传统的二元评分方法。我们在PHYBench上对多种LLMs进行了评估,并将其表现与人类专家进行了对比。结果显示,即便是最先进的推理模型也显著落后于人类专家,凸显了它们在复杂物理推理场景中的局限性和改进需求。我们的基准测试结果和数据集已公开发布于https://phybench-official.github.io/phybench-demo/。
English
We introduce PHYBench, a novel, high-quality benchmark designed for
evaluating reasoning capabilities of large language models (LLMs) in physical
contexts. PHYBench consists of 500 meticulously curated physics problems based
on real-world physical scenarios, designed to assess the ability of models to
understand and reason about realistic physical processes. Covering mechanics,
electromagnetism, thermodynamics, optics, modern physics, and advanced physics,
the benchmark spans difficulty levels from high school exercises to
undergraduate problems and Physics Olympiad challenges. Additionally, we
propose the Expression Edit Distance (EED) Score, a novel evaluation metric
based on the edit distance between mathematical expressions, which effectively
captures differences in model reasoning processes and results beyond
traditional binary scoring methods. We evaluate various LLMs on PHYBench and
compare their performance with human experts. Our results reveal that even
state-of-the-art reasoning models significantly lag behind human experts,
highlighting their limitations and the need for improvement in complex physical
reasoning scenarios. Our benchmark results and dataset are publicly available
at https://phybench-official.github.io/phybench-demo/.Summary
AI-Generated Summary