ChatPaper.aiChatPaper

LoRACode:面向代码嵌入的LoRA适配器

LoRACode: LoRA Adapters for Code Embeddings

March 7, 2025
作者: Saumya Chaturvedi, Aman Chadha, Laurent Bindschaedler
cs.AI

摘要

代码嵌入对于语义代码搜索至关重要;然而,现有方法往往难以精准捕捉代码中固有的语法和上下文细微差别。开源模型如CodeBERT和UniXcoder在可扩展性和效率方面存在局限,而高性能的专有系统则需承担巨大的计算成本。我们提出了一种基于低秩适应(LoRA)的参数高效微调方法,用于构建面向代码检索的任务特定适配器。该方法将可训练参数数量减少至基础模型的不足百分之二,从而能在海量代码语料库上实现快速微调(在两块H100 GPU上,25分钟内处理200万样本)。实验表明,在跨多种编程语言的Code2Code搜索任务中,平均倒数排名(MRR)提升高达9.1%,而在Text2Code搜索任务中,提升幅度更达86.69%。通过任务间和语言间适应性的区分,有助于探索代码检索对语法和语言变异的敏感性。
English
Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.

Summary

AI-Generated Summary

PDF82March 10, 2025