ChatPaper.aiChatPaper

LLM肩上的随机鹦鹉:对物理概念理解的总结评估

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

February 13, 2025
作者: Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou
cs.AI

摘要

我们系统地调查一个广泛讨论的问题:LLM是否真正理解自己所说的内容?这与更熟悉的术语“随机鹦鹉”相关。为此,我们提出了一个对物理概念理解任务PhysiCo进行总结评估的方案,该任务经过精心设计,通过使用抽象描述物理现象的网格格式输入来缓解记忆问题。这些网格代表不同层次的理解,从核心现象、应用示例到与网格世界中其他抽象模式的类比。对我们任务的全面研究表明:(1)包括GPT-4o、o1和Gemini 2.0在内的最先进的LLM,其闪念思维落后于人类约40%;(2)LLM中存在随机鹦鹉现象,因为它们在我们的网格任务上失败,但可以在自然语言中很好地描述和识别相同的概念;(3)我们的任务挑战LLM,是由于内在困难而不是不熟悉的网格格式,因为在相同格式的数据上进行上下文学习和微调对它们的表现几乎没有帮助。
English
In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.

Summary

AI-Generated Summary

PDF1843February 14, 2025