BoxingGym:自動實驗設計和模型發現進展的基準測試
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
January 2, 2025
作者: Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman
cs.AI
摘要
理解世界並用科學理論解釋它是人工智慧研究的核心期望。提出理論、設計實驗來測試它們,然後根據數據修訂它們對於科學發現至關重要。儘管基於LLM的科學代理人具有顯著的潛力,但目前沒有基準系統地測試LLM提出科學模型、收集實驗數據並根據新數據修訂的能力。我們介紹了 BoxingGym,這是一個具有 10 個環境的基準,用於系統地評估實驗設計(例如收集數據來測試科學理論)和模型發現(例如提出和修訂科學理論)。為了實現可處理且量化的評估,我們將每個環境實現為一個生成概率模型,科學代理人可以運行互動實驗。這些概率模型來自各種現實世界的科學領域,從心理學到生態學不等。為了量化評估科學代理人收集信息豐富的實驗數據的能力,我們計算預期信息增益(EIG),這是一個信息理論量,衡量一個實驗如何減少對生成模型參數的不確定性。一個好的科學理論是簡潔且具有預測性的解釋。因此,為了量化評估模型發現,我們要求科學代理人解釋他們的模型,然後評估這個解釋是否使另一個科學代理人能夠對這個環境進行可靠的預測。除了這種基於解釋的評估之外,我們還計算標準的模型評估指標,如預測誤差。我們發現目前的LLM,例如GPT-4o,在實驗設計和模型發現方面都存在困難。我們發現,將基於LLM的代理人與明確的統計模型相結合並不能可靠地改善這些結果。
English
Understanding the world and explaining it with scientific theories is a
central aspiration of artificial intelligence research. Proposing theories,
designing experiments to test them, and then revising them based on data are
fundamental to scientific discovery. Despite the significant promise of
LLM-based scientific agents, no benchmarks systematically test LLM's ability to
propose scientific models, collect experimental data, and revise them in light
of new data. We introduce BoxingGym, a benchmark with 10 environments for
systematically evaluating both experimental design (e.g. collecting data to
test a scientific theory) and model discovery (e.g. proposing and revising
scientific theories). To enable tractable and quantitative evaluation, we
implement each environment as a generative probabilistic model with which a
scientific agent can run interactive experiments. These probabilistic models
are drawn from various real-world scientific domains ranging from psychology to
ecology. To quantitatively evaluate a scientific agent's ability to collect
informative experimental data, we compute the expected information gain (EIG),
an information-theoretic quantity which measures how much an experiment reduces
uncertainty about the parameters of a generative model. A good scientific
theory is a concise and predictive explanation. Therefore, to quantitatively
evaluate model discovery, we ask a scientific agent to explain their model and
then assess whether this explanation enables another scientific agent to make
reliable predictions about this environment. In addition to this
explanation-based evaluation, we compute standard model evaluation metrics such
as prediction errors. We find that current LLMs, such as GPT-4o, struggle with
both experimental design and model discovery. We find that augmenting the
LLM-based agent with an explicit statistical model does not reliably improve
these results.Summary
AI-Generated Summary