EpiCoder:在代码生成中涵盖多样性和复杂性
EpiCoder: Encompassing Diversity and Complexity in Code Generation
January 8, 2025
作者: Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, Scarlett Li
cs.AI
摘要
有效的指令调整对于优化代码LLMs、使模型行为与用户期望一致,并增强模型在实际应用中的性能至关重要。然而,大多数现有方法侧重于代码片段,这些片段仅限于特定功能和刚性结构,限制了合成数据的复杂性和多样性。为了解决这些局限性,我们引入了一种受抽象语法树(AST)启发的基于特征树的合成框架。与AST捕获代码的语法结构不同,我们的框架模拟代码元素之间的语义关系,从而实现更加细致和多样化的数据生成。特征树是从原始数据构建的,并经过迭代改进以增加提取特征的数量和多样性。这一过程使得能够识别代码中更复杂的模式和关系。通过对具有受控深度和广度的子树进行采样,我们的框架允许对生成的代码复杂性进行精确调整,支持从简单的函数级操作到复杂的多文件场景的各种任务。我们对广泛使用的基础模型进行了微调,创建了EpiCoder系列,实现了在多个基准测试中在函数和文件级别上的最先进性能。值得注意的是,实证证据表明我们的方法在合成高度复杂的存储库级代码数据方面具有显著潜力。进一步的分析通过软件工程原则和LLM作为评判方法,阐明了这种方法的优点,通过严格评估数据的复杂性和多样性。
English
Effective instruction tuning is indispensable for optimizing code LLMs,
aligning model behavior with user expectations and enhancing model performance
in real-world applications. However, most existing methods focus on code
snippets, which are limited to specific functionalities and rigid structures,
restricting the complexity and diversity of the synthesized data. To address
these limitations, we introduce a novel feature tree-based synthesis framework
inspired by Abstract Syntax Trees (AST). Unlike AST, which captures syntactic
structure of code, our framework models semantic relationships between code
elements, enabling the generation of more nuanced and diverse data. The feature
tree is constructed from raw data and refined iteratively to increase the
quantity and diversity of the extracted features. This process enables the
identification of more complex patterns and relationships within the code. By
sampling subtrees with controlled depth and breadth, our framework allows
precise adjustments to the complexity of the generated code, supporting a wide
range of tasks from simple function-level operations to intricate multi-file
scenarios. We fine-tuned widely-used base models to create the EpiCoder series,
achieving state-of-the-art performance at both the function and file levels
across multiple benchmarks. Notably, empirical evidence indicates that our
approach shows significant potential in synthesizing highly complex
repository-level code data. Further analysis elucidates the merits of this
approach by rigorously assessing data complexity and diversity through software
engineering principles and LLM-as-a-judge method.Summary
AI-Generated Summary