通过微调来预测新兴能力

摘要

现代大型语言模型（LLM）扩展中的一个基本开放性挑战是对新兴能力缺乏理解。特别是，众所周知，语言模型预训练损失在很大程度上可以作为计算的函数高度可预测。然而，下游能力则不那么可预测 -- 有时甚至表现出新兴跃迁 -- 这使得难以预测未来模型的能力。在这项工作中，我们首先提出了新兴预测任务：在当前具有某项任务上随机少样本准确度的LLMs的情况下，我们能否预测未来模型（GPT-N+1）在该任务上是否会具有非平凡准确度？然后，我们为这个问题发现了一个简单的见解：在给定任务上微调LLMs可以将新兴发生的规模点转移到能力更低的模型。为了实现这一见解，我们可以使用不同数量的数据对LLMs进行微调，并拟合一个预测新兴何时发生的参数函数（即“新兴定律”）。我们使用四个标准的自然语言处理基准来验证这种方法，其中大规模开源LLMs已经展示了新兴（MMLU、GSM8K、CommonsenseQA和CoLA）。仅使用小规模LLMs，在某些情况下，我们发现我们可以准确预测使用多达4倍计算量训练的模型是否已经出现。最后，我们提供了两种实际用途的新兴预测案例研究。

English

A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a function of compute. However, downstream capabilities are far less predictable -- sometimes even exhibiting emergent jumps -- which makes it challenging to anticipate the capabilities of future models. In this work, we first pose the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can we predict whether future models (GPT-N+1) will have non-trivial accuracy on that task? We then discover a simple insight for this problem: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models. To operationalize this insight, we can finetune LLMs with varying amounts of data and fit a parametric function that predicts when emergence will occur (i.e., "emergence laws"). We validate this approach using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and CoLA). Using only small-scale LLMs, we find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged. Finally, we present a case study of two realistic uses for emergence prediction.

通过微调来预测新兴能力

Predicting Emergent Capabilities by Finetuning

摘要

Summary

Support

Support