堆積:一個無污染的多語代碼數據集,用於評估大型語言模型

The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

January 16, 2025
作者: Jonathan Katzy, Razvan Mihai Popescu, Arie van Deursen, Maliheh Izadi
cs.AI

摘要

近來大型語言模型的普及使得開發所需的龐大程式碼資料集得到推動。這導致收集和用於下游研究特定行為或評估大型語言模型時的程式碼資源受限。為解決這個問題,我們釋出了The Heap,這是一個大型多語言資料集,涵蓋了57種程式設計語言,並已與其他開放式程式碼資料集進行了重複消除,使研究人員能夠在不需進行大量資料清理的情況下公平評估大型語言模型。
English
The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.

Summary

AI-Generated Summary

PDF82January 17, 2025