EpiCoder: 코드 생성에서 다양성과 복잡성 포괄하기

초록

효과적인 명령 조정은 코드 LLM의 최적화에 꼭 필요하며, 모델 동작을 사용자 기대에 맞추고 실제 응용 프로그램에서 모델 성능을 향상시키는 데 중요합니다. 그러나 대부분의 기존 방법은 코드 조각에 초점을 맞추고 있으며, 이는 특정 기능과 엄격한 구조에 제한되어 합성 데이터의 복잡성과 다양성을 제한합니다. 이러한 제한 사항을 해결하기 위해 우리는 Abstract Syntax Trees (AST)에서 영감을 받은 새로운 특징 트리 기반 합성 프레임워크를 소개합니다. AST가 코드의 구문 구조를 캡처하는 반면, 우리의 프레임워크는 코드 요소 간의 의미적 관계를 모델링하여 더 세밀하고 다양한 데이터를 생성할 수 있게 합니다. 특징 트리는 원시 데이터에서 구성되어 추출된 특징의 양과 다양성을 증가시키기 위해 반복적으로 정제됩니다. 이 과정을 통해 코드 내에서 더 복잡한 패턴과 관계를 식별할 수 있습니다. 제어된 깊이와 폭으로 하위 트리를 샘플링함으로써, 우리의 프레임워크는 생성된 코드의 복잡성을 정밀하게 조정하여 단순한 함수 수준 작업부터 복잡한 다중 파일 시나리오까지 다양한 작업을 지원합니다. 우리는 널리 사용되는 기본 모델을 세밀하게 조정하여 EpiCoder 시리즈를 만들었으며, 다중 벤치마크에서 기능 및 파일 수준에서 최첨단 성능을 달성했습니다. 특히 경험적 증거는 우리의 접근 방식이 매우 복잡한 리포지토리 수준의 코드 데이터를 합성하는 데 상당한 잠재력을 보여준다는 것을 나타냅니다. 추가적인 분석은 소프트웨어 공학 원칙과 LLM-판사 방법을 통해 데이터 복잡성과 다양성을 엄밀히 평가함으로써 이 접근 방식의 장점을 명확히 밝혀냅니다.

English

Effective instruction tuning is indispensable for optimizing code LLMs, aligning model behavior with user expectations and enhancing model performance in real-world applications. However, most existing methods focus on code snippets, which are limited to specific functionalities and rigid structures, restricting the complexity and diversity of the synthesized data. To address these limitations, we introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST). Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements, enabling the generation of more nuanced and diverse data. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features. This process enables the identification of more complex patterns and relationships within the code. By sampling subtrees with controlled depth and breadth, our framework allows precise adjustments to the complexity of the generated code, supporting a wide range of tasks from simple function-level operations to intricate multi-file scenarios. We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels across multiple benchmarks. Notably, empirical evidence indicates that our approach shows significant potential in synthesizing highly complex repository-level code data. Further analysis elucidates the merits of this approach by rigorously assessing data complexity and diversity through software engineering principles and LLM-as-a-judge method.

EpiCoder: 코드 생성에서 다양성과 복잡성 포괄하기

EpiCoder: Encompassing Diversity and Complexity in Code Generation

초록

Support