堆:一个无污染的多语言代码数据集,用于评估大型语言模型

The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

January 16, 2025
作者: Jonathan Katzy, Razvan Mihai Popescu, Arie van Deursen, Maliheh Izadi
cs.AI

摘要

最近大型语言模型的普及使得开发这些模型所需的大量代码数据集得到了推动。这导致了可用于收集和在下游研究中使用的代码受限,或者在评估大型语言模型时避免数据污染。为解决这一问题,我们发布了The Heap,这是一个大型多语言数据集,涵盖了57种编程语言,并已与其他开放代码数据集进行了去重处理,使研究人员能够在不需要进行大量数据清洗的情况下进行对大型语言模型的公平评估。
English
The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.

Summary

AI-Generated Summary

PDF82January 17, 2025