大规模数据筛选用于指令微调
Large-Scale Data Selection for Instruction Tuning
March 3, 2025
作者: Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, Pradeep Dasigi
cs.AI
摘要
从大规模数据池中筛选高质量训练数据是指令微调语言模型的关键步骤,因为精心筛选的数据集往往能训练出优于使用更大规模、噪声更多数据集训练的模型。目前,指令微调的自动化数据选择方法通常通过从小规模数据池(约10万至20万样本)中选取少量样本(约1万)进行测试。然而,实际部署中广受欢迎的指令微调模型往往基于数十万至数百万样本进行训练,这些样本是从更为庞大的数据池中抽取的。我们系统性地研究了数据选择方法在这些场景下的扩展能力,从多达580万样本的数据池中选取最多250万样本,并在7项多样化任务上进行了评估。结果显示,许多近期提出的方法在此情境下不仅未能超越随机选择(且计算资源消耗更大),甚至在面对更大数据池时性能反而下降。然而,我们发现一种基于表示的数据选择方法变体(RDS+),它利用预训练语言模型隐藏状态的加权平均池化,在所有测试场景中均稳定优于更复杂的方法,同时计算效率更高。我们的研究强调,应更深入地考察所提出的自动化选择方法的扩展特性。我们已在https://github.com/hamishivi/automated-instruction-selection 公开了代码、数据及模型。
English
Selecting high-quality training data from a larger pool is a crucial step
when instruction-tuning language models, as carefully curated datasets often
produce models that outperform those trained on much larger, noisier datasets.
Automated data selection approaches for instruction-tuning are typically tested
by selecting small datasets (roughly 10k samples) from small pools (100-200k
samples). However, popular deployed instruction-tuned models often train on
hundreds of thousands to millions of samples, subsampled from even larger data
pools. We present a systematic study of how well data selection methods scale
to these settings, selecting up to 2.5M samples from pools of up to 5.8M
samples and evaluating across 7 diverse tasks. We show that many recently
proposed methods fall short of random selection in this setting (while using
more compute), and even decline in performance when given access to larger
pools of data to select over. However, we find that a variant of
representation-based data selection (RDS+), which uses weighted mean pooling of
pretrained LM hidden states, consistently outperforms more complex methods
across all settings tested -- all whilst being more compute-efficient. Our
findings highlight that the scaling properties of proposed automated selection
methods should be more closely examined. We release our code, data, and models
at https://github.com/hamishivi/automated-instruction-selection.Summary
AI-Generated Summary