EpiCoder:在程式碼生成中涵蓋多樣性和複雜性
EpiCoder: Encompassing Diversity and Complexity in Code Generation
January 8, 2025
作者: Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, Scarlett Li
cs.AI
摘要
有效的指令調整對於優化程式語言模型代碼、對齊模型行為與使用者期望以及增強模型在實際應用中的性能至關重要。然而,大多數現有方法專注於代碼片段,這些片段僅限於特定功能和嚴格結構,限制了合成數據的複雜性和多樣性。為了解決這些限制,我們引入了一種新穎的基於特徵樹的合成框架,靈感來自於抽象語法樹(AST)。與AST不同,後者捕捉代碼的語法結構,我們的框架模擬代碼元素之間的語義關係,從而實現更微妙和多樣化數據的生成。特徵樹從原始數據構建,並通過迭代進行精煉,以增加提取特徵的數量和多樣性。這個過程使得能夠識別代碼中更複雜的模式和關係。通過採樣具有受控深度和廣度的子樹,我們的框架允許對生成代碼的複雜性進行精確調整,支持從簡單的函數級操作到複雜的多文件情景的廣泛任務。我們對廣泛使用的基本模型進行了微調,創建了EpiCoder系列,並在多個基準測試中實現了功能和文件級別的最新性能。值得注意的是,實證證據表明我們的方法在合成高度複雜的存儲庫級代碼數據方面具有顯著潛力。進一步的分析通過軟體工程原則和LLM作為評判方法,闡明了這種方法的優點,嚴格評估數據的複雜性和多樣性。
English
Effective instruction tuning is indispensable for optimizing code LLMs,
aligning model behavior with user expectations and enhancing model performance
in real-world applications. However, most existing methods focus on code
snippets, which are limited to specific functionalities and rigid structures,
restricting the complexity and diversity of the synthesized data. To address
these limitations, we introduce a novel feature tree-based synthesis framework
inspired by Abstract Syntax Trees (AST). Unlike AST, which captures syntactic
structure of code, our framework models semantic relationships between code
elements, enabling the generation of more nuanced and diverse data. The feature
tree is constructed from raw data and refined iteratively to increase the
quantity and diversity of the extracted features. This process enables the
identification of more complex patterns and relationships within the code. By
sampling subtrees with controlled depth and breadth, our framework allows
precise adjustments to the complexity of the generated code, supporting a wide
range of tasks from simple function-level operations to intricate multi-file
scenarios. We fine-tuned widely-used base models to create the EpiCoder series,
achieving state-of-the-art performance at both the function and file levels
across multiple benchmarks. Notably, empirical evidence indicates that our
approach shows significant potential in synthesizing highly complex
repository-level code data. Further analysis elucidates the merits of this
approach by rigorously assessing data complexity and diversity through software
engineering principles and LLM-as-a-judge method.Summary
AI-Generated Summary