힙: 대형 언어 모델 평가를 위한 오염되지 않은 다국어 코드 데이터셋

초록

최근 대형 언어 모델의 인기 상승으로 인해 이를 훈련하기 위해 필요한 방대한 코드 데이터셋의 개발이 촉진되었습니다. 이는 특정 행동의 하류 조사나 데이터 오염 없이 대형 언어 모델을 평가하기 위해 수집 및 사용 가능한 코드가 제한되어 있다는 것을 의미합니다. 이 문제를 해결하기 위해 우리는 57가지 프로그래밍 언어를 다루는 대규모 다국어 데이터셋인 'The Heap'을 공개합니다. 이 데이터셋은 다른 공개 코드 데이터셋과 중복을 제거하여, 연구자들이 중요한 데이터 정리 작업 없이 대형 언어 모델을 공정하게 평가할 수 있도록 합니다.

English

The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.

힙: 대형 언어 모델 평가를 위한 오염되지 않은 다국어 코드 데이터셋

The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

초록

Summary

Support