衡量大型語言模型的泛化複雜度
Quantifying Generalization Complexity for Large Language Models
October 2, 2024
作者: Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, James Glass
cs.AI
摘要
儘管大型語言模型(LLMs)展現出在理解複雜查詢和執行複雜任務方面的卓越能力,但它們的泛化能力常常與記憶緊密相關,需要更精確的評估。為應對這一挑戰,我們引入了Scylla,一個動態評估框架,定量衡量LLMs的泛化能力。Scylla通過對模型在分佈內(ID)和分佈外(OOD)數據上的表現進行評估,涵蓋了20個任務,跨越5個不同複雜度級別,以解開泛化與記憶的糾纏。通過大量實驗,我們揭示了任務複雜度與ID和OOD數據之間性能差距之間的非單調關係,我們稱之為泛化谷。具體而言,這一現象揭示了一個關鍵閾值 - 稱為臨界複雜度 - 在此閾值上,非泛化行為的依賴達到高峰,標誌著LLMs泛化能力的上限。隨著模型大小的增加,臨界複雜度向著更高級別的任務複雜度轉移,表明較大的模型可以在過度依賴記憶之前處理更複雜的推理任務。利用Scylla和臨界複雜度的概念,我們對28個LLMs進行基準測試,包括開源模型如LLaMA和Qwen家族,以及封閉源模型如Claude和GPT,提供更穩健的評估,並建立對LLMs泛化能力的更清晰理解。
English
While large language models (LLMs) have shown exceptional capabilities in
understanding complex queries and performing sophisticated tasks, their
generalization abilities are often deeply entangled with memorization,
necessitating more precise evaluation. To address this challenge, we introduce
Scylla, a dynamic evaluation framework that quantitatively measures the
generalization abilities of LLMs. Scylla disentangles generalization from
memorization via assessing model performance on both in-distribution (ID) and
out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity.
Through extensive experiments, we uncover a non-monotonic relationship between
task complexity and the performance gap between ID and OOD data, which we term
the generalization valley. Specifically, this phenomenon reveals a critical
threshold - referred to as critical complexity - where reliance on
non-generalizable behavior peaks, indicating the upper bound of LLMs'
generalization capabilities. As model size increases, the critical complexity
shifts toward higher levels of task complexity, suggesting that larger models
can handle more complex reasoning tasks before over-relying on memorization.
Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs
including both open-sourced models such as LLaMA and Qwen families, and
close-sourced models like Claude and GPT, providing a more robust evaluation
and establishing a clearer understanding of LLMs' generalization capabilities.Summary
AI-Generated Summary