模型能帮助我们创建更好的模型吗？评估LLM作为数据科学家

摘要

我们提出了一个针对大型语言模型的基准测试，旨在解决数据科学中最知识密集的任务之一：编写特征工程代码，这需要领域知识以及对基础问题和数据结构的深刻理解。模型接收一个提示中的数据集描述，并被要求生成相应的转换代码。评估分数根据在修改后的数据集上拟合的XGBoost模型相对于原始数据的改进来确定。通过对最先进模型的广泛评估，并与已建立的基准进行比较，我们展示了我们提议的FeatEng能够廉价高效地评估大型语言模型的广泛能力，与现有方法形成对比。

English

We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.

模型能帮助我们创建更好的模型吗？评估LLM作为数据科学家

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

摘要

Summary

Support

Support