堆積:一個無污染的多語代碼數據集,用於評估大型語言模型
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models
January 16, 2025
作者: Jonathan Katzy, Razvan Mihai Popescu, Arie van Deursen, Maliheh Izadi
cs.AI
摘要
近來大型語言模型的普及使得開發所需的龐大程式碼資料集得到推動。這導致收集和用於下游研究特定行為或評估大型語言模型時的程式碼資源受限。為解決這個問題,我們釋出了The Heap,這是一個大型多語言資料集,涵蓋了57種程式設計語言,並已與其他開放式程式碼資料集進行了重複消除,使研究人員能夠在不需進行大量資料清理的情況下公平評估大型語言模型。
English
The recent rise in the popularity of large language models has spurred the
development of extensive code datasets needed to train them. This has left
limited code available for collection and use in the downstream investigation
of specific behaviors, or evaluation of large language models without suffering
from data contamination. To address this problem, we release The Heap, a large
multilingual dataset covering 57 programming languages that has been
deduplicated with respect to other open datasets of code, enabling researchers
to conduct fair evaluations of large language models without significant data
cleaning overhead.Summary
AI-Generated Summary