模型能幫助我們創建更好的模型嗎？評估LLM作為資料科學家

摘要

我們提出了一個針對大型語言模型的基準測試，旨在應對數據科學中最知識密集的任務之一：編寫特徵工程代碼，這需要領域知識以及對底層問題和數據結構的深刻理解。模型接收一個提示中的數據集描述，並被要求生成轉換該描述的代碼。評估分數是通過對修改後的數據集擬合的 XGBoost 模型的改進來衡量，與原始數據相比。通過對最先進模型的廣泛評估並與已建立的基準進行比較，我們證明了我們提案中的 FeatEng 能夠便宜高效地評估大型語言模型的廣泛能力，與現有方法形成對比。

English

We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.

模型能幫助我們創建更好的模型嗎？評估LLM作為資料科學家

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

摘要

Summary

Support

Support