增强低资源语言的代码生成：没有银弹

摘要

大型语言模型（LLMs）的出现显著推动了自动生成代码领域的发展。LLMs依赖于庞大且多样的数据集，以学习编程语言的语法、语义和使用模式。对于低资源语言（即特点是训练数据稀缺的小众编程语言），这些数据的有限可用性阻碍了模型有效泛化的能力，导致生成代码的性能较高资源语言要差。因此，人们正在寻求能够弥合这一性能差距的技术。我们提出了一项实证研究，探讨了几种提升LLMs在低资源语言上性能的方法，包括：（i）经典微调，但由于训练数据稀缺而受到大小限制；（ii）三种上下文学习的变体，通过设计提示信息为LLM提供有关低资源语言的额外信息（例如，展示目标语言特征的少样本示例）；以及（iii）一个预训练目标，教导模型如何在高资源语言和低资源语言之间进行翻译。我们研究的背景是两种低资源语言（R和Racket）和六种具有不同架构和规模的LLMs。我们的发现显示，对于较小的LLMs，微调通常是最佳选择，可能是因为即使是小数据集也足以训练其有限数量的参数。随着模型规模的增大，上下文学习变得越来越有效，代表着一种安全且廉价的选择（即总是有所帮助，但效果有所不同）。与此不同的是，当进行微调时，非常大的LLMs可能会在低资源语言上降低性能，可能是因为缺乏足够的数据来有效更新其权重。

English

The advent of Large Language Models (LLMs) has significantly advanced the field of automated code generation. LLMs rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages (i.e., niche programming languages characterized by the scarcity of training data), the limited availability of such data hampers the models' ability to generalize effectively, resulting in poorer code generation performance as compared to high-resource languages. For this reason, there is a quest for techniques able to close this performance gap. We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages, namely: (i) a classic fine-tuning, which is however capped in size by the scarcity of training data; (ii) three variants of in-context learning, with prompts crafted to provide the LLM with additional information about the low-resource language (e.g., few-shot examples showcasing features of the targeted language); and (iii) a pre-training objective teaching the model how to translate between high- and low-resource languages. The context of our study are two low-resource languages (R and Racket) and six LLMs having different architectures and sizes. Our findings reveal that a fine-tuning is usually the best choice for smaller LLMs, possibly due to the fact that even a small dataset is sufficient to train their limited number of parameters. With the increase in size of the models, in-context learning becomes more and more effective, representing a safe and cheap bet (i.e., it always helps, but with different magnitudes). Differently, very large LLMs may deteriorate their performance on low-resource languages when fine-tuning is performed, possibly due to the lack of enough data needed to effectively update their weights.

增强低资源语言的代码生成：没有银弹

Enhancing Code Generation for Low-Resource Languages: No Silver Bullet

摘要

Summary

Support