PHYBench:大型語言模型物理感知與推理能力的全面評估
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
April 22, 2025
作者: Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, Jianxiang Li, Feiyu Tao, Qihua Sun, Zhou Liang, Yushu Mu, Zhongxuan Li, Jing-Jun Zhang, Shutao Zhang, Xiaotian Li, Xingqi Xia, Jiawei Lin, Zheyu Shen, Jiahang Chen, Qiuhao Xiong, Binran Wang, Fengyuan Wang, Ziyang Ni, Bohan Zhang, Fan Cui, Changkun Shao, Qing-Hong Cao, Ming-xing Luo, Muhan Zhang, Hua Xing Zhu
cs.AI
摘要
我們推出了PHYBench,這是一個新穎且高品質的基準測試,旨在評估大型語言模型(LLMs)在物理情境下的推理能力。PHYBench包含500道精心挑選的物理問題,這些問題基於真實世界的物理場景,旨在評估模型對現實物理過程的理解與推理能力。涵蓋力學、電磁學、熱力學、光學、現代物理學及高級物理學,該基準測試的難度範圍從高中練習題到大學問題,乃至物理奧林匹克競賽挑戰。此外,我們提出了表達式編輯距離(EED)分數,這是一種基於數學表達式之間編輯距離的新穎評估指標,能有效捕捉模型推理過程和結果的差異,超越了傳統的二進制評分方法。我們在PHYBench上評估了多種LLMs,並將其表現與人類專家進行比較。我們的結果顯示,即便是最先進的推理模型也顯著落後於人類專家,凸顯了它們在複雜物理推理場景中的局限性和改進需求。我們的基準測試結果和數據集公開於https://phybench-official.github.io/phybench-demo/。
English
We introduce PHYBench, a novel, high-quality benchmark designed for
evaluating reasoning capabilities of large language models (LLMs) in physical
contexts. PHYBench consists of 500 meticulously curated physics problems based
on real-world physical scenarios, designed to assess the ability of models to
understand and reason about realistic physical processes. Covering mechanics,
electromagnetism, thermodynamics, optics, modern physics, and advanced physics,
the benchmark spans difficulty levels from high school exercises to
undergraduate problems and Physics Olympiad challenges. Additionally, we
propose the Expression Edit Distance (EED) Score, a novel evaluation metric
based on the edit distance between mathematical expressions, which effectively
captures differences in model reasoning processes and results beyond
traditional binary scoring methods. We evaluate various LLMs on PHYBench and
compare their performance with human experts. Our results reveal that even
state-of-the-art reasoning models significantly lag behind human experts,
highlighting their limitations and the need for improvement in complex physical
reasoning scenarios. Our benchmark results and dataset are publicly available
at https://phybench-official.github.io/phybench-demo/.Summary
AI-Generated Summary