揭示大语言模型下游性能扩展:基于聚类的视角
Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective
February 24, 2025
作者: Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li
cs.AI
摘要
计算技术的飞速发展极大地提升了大规模语言模型(LLMs)训练的规模与成本。在模型训练前准确预测下游任务表现,对于资源高效配置至关重要,然而这一目标面临两大主要挑战:(1)“涌现现象”,即下游性能指标仅在经过大量训练后才具备意义,这限制了使用较小模型进行预测的能力;(2)任务难度分布不均及缺乏一致的缩放规律,导致性能指标存在显著波动。现有性能预测方法在准确性与可靠性上均显不足,从而阻碍了对LLM潜在能力的评估。为应对这些挑战,我们提出了一种基于难度聚类的下游性能预测框架(Clustering-On-Difficulty, COD)。COD首先通过依据难度特征对任务进行聚类,构建一个可预测的支持子集,策略性地排除非涌现及不可扩展的聚类。所选子集上的得分作为有效的中介预测因子,用于预测完整评估集上的下游表现。在理论支持下,我们推导出一个映射函数,将性能指标从可预测子集转换至完整评估集,从而确保LLM下游性能的准确外推。该方法已应用于预测一个70B规模LLM的性能缩放,为训练资源分配提供了可操作的见解,并辅助监控训练过程。值得注意的是,COD通过集成小模型,在70B LLM上实现了卓越的预测精度,在八个重要LLM评估基准上的绝对平均偏差仅为1.36%。
English
The rapid advancements in computing dramatically increase the scale and cost
of training Large Language Models (LLMs). Accurately predicting downstream task
performance prior to model training is crucial for efficient resource
allocation, yet remains challenging due to two primary constraints: (1) the
"emergence phenomenon", wherein downstream performance metrics become
meaningful only after extensive training, which limits the ability to use
smaller models for prediction; (2) Uneven task difficulty distributions and the
absence of consistent scaling laws, resulting in substantial metric
variability. Existing performance prediction methods suffer from limited
accuracy and reliability, thereby impeding the assessment of potential LLM
capabilities. To address these challenges, we propose a
Clustering-On-Difficulty (COD) downstream performance prediction framework. COD
first constructs a predictable support subset by clustering tasks based on
difficulty features, strategically excluding non-emergent and non-scalable
clusters. The scores on the selected subset serve as effective intermediate
predictors of downstream performance on the full evaluation set. With
theoretical support, we derive a mapping function that transforms performance
metrics from the predictable subset to the full evaluation set, thereby
ensuring accurate extrapolation of LLM downstream performance. The proposed
method has been applied to predict performance scaling for a 70B LLM, providing
actionable insights for training resource allocation and assisting in
monitoring the training process. Notably, COD achieves remarkable predictive
accuracy on the 70B LLM by leveraging an ensemble of small models,
demonstrating an absolute mean deviation of 1.36% across eight important LLM
evaluation benchmarks.Summary
AI-Generated Summary