反思基准：利用反思来探究人工智能智能化

摘要

对于智能系统与世界互动，根据意外结果调整信念或行为的能力，即反思，是至关重要的。从认知科学的角度来看，这是适用于人类和人工智能系统的智能的核心原则。为了解决关于大型语言模型（LLMs）智能性的争论，我们提出了反思基准（Reflection-Bench），这是一个全面的基准测试，包括7个任务，涵盖了对反思至关重要的核心认知功能，包括感知、记忆、信念更新、决策、预测、假设思维和元反思。我们评估了13个知名LLMs的性能，如OpenAI o1、GPT-4、Claude 3.5 Sonnet等。结果表明，当前的LLMs仍然缺乏令人满意的反思能力。我们讨论了这些结果的潜在原因，并提出了未来研究的潜在途径。总之，反思基准提供了评估工具和启发，可用于开发能够可靠与环境互动的人工智能。我们的数据和代码可在https://github.com/YabYum/ReflectionBench 上找到。

English

The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at https://github.com/YabYum/ReflectionBench.

反思基准：利用反思来探究人工智能智能化

Reflection-Bench: probing AI intelligence with reflection

摘要

Summary

Support

Support