반사-벤치: 반사를 통해 AI 지능을 조사하기

초록

예상치 못한 결과에 대한 믿음이나 행동을 조정하는 능력, 즉 반성은 지능 시스템이 세계와 상호 작용하는 데 기본적인 역할을 합니다. 인지과학적 관점에서 이는 인간 및 AI 시스템 모두에 적용 가능한 지능의 핵심 원칙으로 작용합니다. 대형 언어 모델(Large Language Models, LLMs)의 지능에 대한 논의를 다루기 위해 우리는 Reflection-Bench를 제안합니다. 이는 지각, 기억, 믿음 업데이팅, 의사 결정, 예측, 가정적 사고, 메타-반성 등 반성에 중요한 핵심 인지 기능을 포함한 7가지 작업으로 이루어진 포괄적인 벤치마크입니다. 우리는 OpenAI o1, GPT-4, Claude 3.5 Sonnet 등 13가지 주요 LLMs의 성능을 평가했습니다. 결과는 현재 LLMs가 아직도 충분한 반성 능력을 갖추지 못하고 있다는 것을 보여줍니다. 이러한 결과의 근본적인 원인을 논의하고 향후 연구를 위한 잠재적인 방향을 제안합니다. 결론적으로, Reflection-Bench는 환경과 신뢰성 있게 상호 작용할 수 있는 AI를 개발하기 위한 평가 도구와 영감을 제공합니다. 우리의 데이터와 코드는 https://github.com/YabYum/ReflectionBench에서 확인할 수 있습니다.

English

The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at https://github.com/YabYum/ReflectionBench.

반사-벤치: 반사를 통해 AI 지능을 조사하기

Reflection-Bench: probing AI intelligence with reflection

초록

Summary

Support