인간의 선호도에 따른 CodeLLM의 평가 및 정렬

초록

대규모 언어 모델 (codeLLMs)은 코드 생성에서 상당한 발전을 이루었습니다. 이전의 대부분의 코드 관련 벤치마크는 다양한 프로그래밍 연습 문제와 해당 테스트 케이스로 구성되어 있으며, 코드 LLM의 성능과 능력을 평가하는 데 공통적으로 사용됩니다. 그러나 현재의 코드 LLM은 올바른 코드 조각을 합성하는 데 초점을 맞추고 있어 쿼리가 실제 응용 시나리오에서 샘플링되어야 하고 모델이 생성한 응답이 인간의 선호도를 만족시켜야 하는 점을 무시합니다. 모델이 생성한 응답과 인간의 선호도 사이의 간극을 좁히기 위해, 우리는 복잡성과 다양성을 흉내 내는 엄격한 인간이 선별한 벤치마크인 CodeArena를 제시합니다. 이 벤치마크는 사용자 쿼리에서 세심하게 선별된 40가지 범주와 44가지 프로그래밍 언어를 포괄하는 397개의 고품질 샘플로 구성되어 있습니다. 더 나아가, 웹사이트에서의 지침을 확장하여 다양한 합성 지침 말뭉치 SynCode-Instruct (약 20B 토큰)를 제안하여 대규모 합성 지침 미세 조정의 효과를 검증합니다. 이를 통해 합성 지침 데이터로 완전히 훈련된 Qwen2.5-SynCoder는 오픈 소스 코드 LLM의 최고 수준 성능을 달성할 수 있습니다. 결과는 실행 기반 벤치마크와 CodeArena 간의 성능 차이를 발견했습니다. 40개 이상의 LLM에 대한 CodeArena의 체계적인 실험 결과는 오픈 SOTA 코드 LLM (예: Qwen2.5-Coder)와 프로프리어터리 LLM (예: OpenAI o1) 간의 주목할만한 성능 차이를 보여줍니다. 이는 인간의 선호도 조정의 중요성을 강조합니다.

English

Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.\url{https://codearenaeval.github.io/ }

인간의 선호도에 따른 CodeLLM의 평가 및 정렬

Evaluating and Aligning CodeLLMs on Human Preference

초록

Summary

Support