모델이 우리가 더 나은 모델을 만드는 데 도움이 될까? 데이터 과학자로서 LLMs를 평가하기

초록

우리는 대규모 언어 모델을 위한 벤치마크를 제시합니다. 이 벤치마크는 데이터 과학에서 가장 지식 집약적인 작업 중 하나인 피처 엔지니어링 코드 작성에 대응하기 위해 설계되었습니다. 이 작업은 깊은 문제 이해와 데이터 구조에 대한 도메인 지식이 필요합니다. 모델은 프롬프트로 제공된 데이터셋 설명을 받아들이고 이를 변환하는 코드를 생성하도록 요청됩니다. 평가 점수는 수정된 데이터셋에 맞춰진 XGBoost 모델의 성능 향상을 기초로 합니다. 최첨단 모델들을 철저히 평가하고 잘 알려진 벤치마크와 비교함으로써, 우리 제안의 FeatEng가 기존 방법과 대조적으로 LLM의 폭넓은 능력을 저렴하고 효율적으로 평가할 수 있음을 입증합니다.

English

We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.

모델이 우리가 더 나은 모델을 만드는 데 도움이 될까? 데이터 과학자로서 LLMs를 평가하기

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

초록

Support