커피-짐: 오류 코드에 대한 자연어 피드백을 평가하고 개선하는 환경

초록

본 논문은 코드 편집에 대한 피드백을 제공하는 모델을 훈련하는 데 사용되는 포괄적인 강화 학습 환경인 Coffee-Gym을 제시합니다. Coffee-Gym에는 두 가지 주요 구성 요소가 포함되어 있습니다: (1) 코딩 문제에 대한 인간의 코드 편집 추적을 포함하고, 잘못된 코드를 편집하는 데 도움이 되는 기계 작성 피드백을 제공하는 데이터 세트인 Coffee; (2) 수정된 코드의 성능을 단위 테스트에서 평가하여 피드백의 유용성을 충실히 반영하는 보상 함수인 CoffeeEval. Coffee-Gym은 강화 학습을 통해 피드백 모델을 훈련하기 위한 고품질 데이터 세트의 부족 문제를 해결하고, SOTA 보상 모델인 GPT-4보다 더 정확한 보상을 제공합니다. Coffee-Gym을 적용함으로써, 오픈 소스 코드 LLMs의 코드 편집을 향상시키는 데 기존의 기준선을 능가하는 피드백 모델을 유도하여, 이를 폐쇄 소스 LLMs와 비교 가능하게 만듭니다. 데이터 세트와 모델 체크포인트를 공개적으로 제공합니다.

English

This paper presents Coffee-Gym, a comprehensive RL environment for training models that provide feedback on code editing. Coffee-Gym includes two major components: (1) Coffee, a dataset containing humans' code edit traces for coding questions and machine-written feedback for editing erroneous code; (2) CoffeeEval, a reward function that faithfully reflects the helpfulness of feedback by assessing the performance of the revised code in unit tests. With them, Coffee-Gym addresses the unavailability of high-quality datasets for training feedback models with RL, and provides more accurate rewards than the SOTA reward model (i.e., GPT-4). By applying Coffee-Gym, we elicit feedback models that outperform baselines in enhancing open-source code LLMs' code editing, making them comparable with closed-source LLMs. We make the dataset and the model checkpoint publicly available.

커피-짐: 오류 코드에 대한 자연어 피드백을 평가하고 개선하는 환경

Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code

초록

Summary

Support

Support