ChatPaper.aiChatPaper

反思臺:透過反思探究人工智慧智能

Reflection-Bench: probing AI intelligence with reflection

October 21, 2024
作者: Lingyu Li, Yixu Wang, Haiquan Zhao, Shuqi Kong, Yan Teng, Chunbo Li, Yingchun Wang
cs.AI

摘要

對於智能系統與世界互動,根據認知科學的觀點,能夠根據意外結果或反思而調整信念或行為的能力是基本的。這是適用於人類和人工智能系統的智能的核心原則。為了解決關於大型語言模型(LLMs)智能性的爭論,我們提出了Reflection-Bench,這是一個全面的基準測試,包括7個任務,涵蓋了對反思至關重要的核心認知功能,包括知覺、記憶、信念更新、決策、預測、反事實思考和元反思。我們評估了13個知名LLMs的表現,如OpenAI o1、GPT-4、Claude 3.5 Sonnet等。結果顯示當前LLMs仍然缺乏令人滿意的反思能力。我們討論了這些結果的潛在原因,並提出了未來研究的潛在途徑。總之,Reflection-Bench提供了評估工具和啟發,可用於開發能夠可靠與環境互動的人工智能。我們的數據和代碼可在https://github.com/YabYum/ReflectionBench找到。
English
The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at https://github.com/YabYum/ReflectionBench.

Summary

AI-Generated Summary

PDF62November 16, 2024