堆:一个无污染的多语言代码数据集,用于评估大型语言模型
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models
January 16, 2025
作者: Jonathan Katzy, Razvan Mihai Popescu, Arie van Deursen, Maliheh Izadi
cs.AI
摘要
最近大型语言模型的普及使得开发这些模型所需的大量代码数据集得到了推动。这导致了可用于收集和在下游研究中使用的代码受限,或者在评估大型语言模型时避免数据污染。为解决这一问题,我们发布了The Heap,这是一个大型多语言数据集,涵盖了57种编程语言,并已与其他开放代码数据集进行了去重处理,使研究人员能够在不需要进行大量数据清洗的情况下进行对大型语言模型的公平评估。
English
The recent rise in the popularity of large language models has spurred the
development of extensive code datasets needed to train them. This has left
limited code available for collection and use in the downstream investigation
of specific behaviors, or evaluation of large language models without suffering
from data contamination. To address this problem, we release The Heap, a large
multilingual dataset covering 57 programming languages that has been
deduplicated with respect to other open datasets of code, enabling researchers
to conduct fair evaluations of large language models without significant data
cleaning overhead.Summary
AI-Generated Summary