BoxingGym: 自動実験設計とモデル発見の進捗状況のベンチマーキング

要旨

世界を理解し、科学的理論で説明することは、人工知能研究の中心的な志向です。理論を提案し、それを検証するための実験を設計し、そしてデータに基づいてそれらを修正することは、科学的発見にとって基本的です。LLMに基づく科学エージェントの大きな可能性にもかかわらず、LLMが科学モデルを提案し、実験データを収集し、新しいデータを元に修正する能力を系統的にテストするベンチマークは存在しませんでした。私たちは、科学的理論を検証するためのデータ収集（例：科学的理論を検証するためのデータ収集）とモデルの発見（例：科学的理論の提案と修正）の両方を系統的に評価するための10の環境を備えたベンチマークであるBoxingGymを紹介します。取り組みやすく定量的に評価するために、各環境を生成確率モデルとして実装し、科学エージェントが対話型実験を実行できるようにしています。これらの確率モデルは、心理学から生態学までのさまざまな実世界の科学領域から抽出されています。科学エージェントが情報収集実験を行う能力を定量的に評価するために、生成モデルのパラメータに関する不確実性をどれだけ減少させるかを測定する情報理論的数量である期待情報利得（EIG）を計算します。良い科学的理論は簡潔で予測可能な説明です。したがって、モデルの発見を定量的に評価するために、科学エージェントに自分のモデルを説明してもらい、その説明が他の科学エージェントがこの環境について信頼できる予測を行うのを可能にするかどうかを評価します。この説明に基づく評価に加えて、予測誤差などの標準的なモデル評価指標も計算します。我々は、GPT-4oなどの現行のLLMが実験設計とモデルの発見の両方に苦労していることを発見しました。LLMベースのエージェントに明示的な統計モデルを追加することがこれらの結果を確実に改善しないことを見出しました。

English

Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.

BoxingGym: 自動実験設計とモデル発見の進捗状況のベンチマーキング

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

要旨

Summary

Support