透過微調來預測新興能力

摘要

在現代大型語言模型（LLM）擴展中的一個基本開放挑戰是對新興能力的理解不足。特別是，已知語言模型預訓練損失在計算量的函數中具有高度可預測性。然而，下游能力卻遠不及預測，有時甚至表現出新興性的跳躍，這使得預測未來模型的能力變得具有挑戰性。在這項研究中，我們首先提出了新興性預測的任務：在當前具有隨機少量樣本準確度的LLM的情況下，我們能否預測未來模型（GPT-N+1）在該任務上是否會具有非平凡的準確度？然後，我們為這個問題發現了一個簡單的見解：對給定任務進行LLM微調可以將新興性發生的規模點轉向能力較差的模型。為了實現這一見解，我們可以對LLM進行不同數量的數據微調，並擬合一個預測新興性何時發生的參數函數（即“新興性定律”）。我們使用四個標準的自然語言處理基準來驗證這種方法，其中大規模的開源LLM已經展現出新興性（MMLU、GSM8K、CommonsenseQA和CoLA）。僅使用小規模LLM，我們發現，在某些情況下，我們可以準確預測使用多達4倍計算量訓練的模型是否已經出現新興性。最後，我們提出了兩個實際應用新興性預測的案例研究。

English

A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a function of compute. However, downstream capabilities are far less predictable -- sometimes even exhibiting emergent jumps -- which makes it challenging to anticipate the capabilities of future models. In this work, we first pose the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can we predict whether future models (GPT-N+1) will have non-trivial accuracy on that task? We then discover a simple insight for this problem: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models. To operationalize this insight, we can finetune LLMs with varying amounts of data and fit a parametric function that predicts when emergence will occur (i.e., "emergence laws"). We validate this approach using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and CoLA). Using only small-scale LLMs, we find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged. Finally, we present a case study of two realistic uses for emergence prediction.

透過微調來預測新興能力

Predicting Emergent Capabilities by Finetuning

摘要

Support