대형 언어 모델에 대한 일반화 복잡성 측정하기

초록

대형 언어 모델(LLMs)은 복잡한 쿼리를 이해하고 정교한 작업을 수행하는 뛰어난 능력을 보여주었지만, 그들의 일반화 능력은 종종 기억과 깊게 얽혀 있어 더 정확한 평가가 필요하다. 이러한 도전에 대처하기 위해 우리는 LLMs의 일반화 능력을 양적으로 측정하는 동적 평가 프레임워크인 Scylla를 소개한다. Scylla는 일반화와 기억을 분리하여 모델 성능을 인식 분포(ID) 및 분포 외 데이터(OOD)에서 20가지 작업을 통해 5단계의 복잡성으로 평가한다. 광범위한 실험을 통해 작업 복잡성과 ID 및 OOD 데이터 간의 성능 차이인 일반화 골짜기라고 하는 현상을 발견했다. 특히, 이 현상은 비일관적 행동에 의존하는 정도가 정점에 도달하는 중요한 임계 복잡성이라는 것을 보여주며, LLMs의 일반화 능력의 상한을 나타낸다. 모델 크기가 증가함에 따라 임계 복잡성이 더 높은 작업 복잡성으로 이동함으로써, 더 큰 모델은 기억에 과도하게 의존하기 전에 더 복잡한 추론 작업을 처리할 수 있다는 것을 시사한다. Scylla와 임계 복잡성 개념을 활용하여, LLaMA 및 Qwen 가족과 같은 오픈 소스 모델과 Claude 및 GPT와 같은 폐쇄 소스 모델을 포함한 28개의 LLMs를 벤치마킹하여 더 견고한 평가를 제공하고 LLMs의 일반화 능력을 더 명확하게 이해한다.

English

While large language models (LLMs) have shown exceptional capabilities in understanding complex queries and performing sophisticated tasks, their generalization abilities are often deeply entangled with memorization, necessitating more precise evaluation. To address this challenge, we introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold - referred to as critical complexity - where reliance on non-generalizable behavior peaks, indicating the upper bound of LLMs' generalization capabilities. As model size increases, the critical complexity shifts toward higher levels of task complexity, suggesting that larger models can handle more complex reasoning tasks before over-relying on memorization. Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs including both open-sourced models such as LLaMA and Qwen families, and close-sourced models like Claude and GPT, providing a more robust evaluation and establishing a clearer understanding of LLMs' generalization capabilities.

대형 언어 모델에 대한 일반화 복잡성 측정하기

Quantifying Generalization Complexity for Large Language Models

초록

Summary

Support

Support