ChatPaper.aiChatPaper

大规模数据筛选用于指令微调

Large-Scale Data Selection for Instruction Tuning

March 3, 2025
作者: Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, Pradeep Dasigi
cs.AI

摘要

从大规模数据池中筛选高质量训练数据是指令微调语言模型的关键步骤,因为精心筛选的数据集往往能训练出优于使用更大规模、噪声更多数据集训练的模型。目前,指令微调的自动化数据选择方法通常通过从小规模数据池(约10万至20万样本)中选取少量样本(约1万)进行测试。然而,实际部署中广受欢迎的指令微调模型往往基于数十万至数百万样本进行训练,这些样本是从更为庞大的数据池中抽取的。我们系统性地研究了数据选择方法在这些场景下的扩展能力,从多达580万样本的数据池中选取最多250万样本,并在7项多样化任务上进行了评估。结果显示,许多近期提出的方法在此情境下不仅未能超越随机选择(且计算资源消耗更大),甚至在面对更大数据池时性能反而下降。然而,我们发现一种基于表示的数据选择方法变体(RDS+),它利用预训练语言模型隐藏状态的加权平均池化,在所有测试场景中均稳定优于更复杂的方法,同时计算效率更高。我们的研究强调,应更深入地考察所提出的自动化选择方法的扩展特性。我们已在https://github.com/hamishivi/automated-instruction-selection 公开了代码、数据及模型。
English
Selecting high-quality training data from a larger pool is a crucial step when instruction-tuning language models, as carefully curated datasets often produce models that outperform those trained on much larger, noisier datasets. Automated data selection approaches for instruction-tuning are typically tested by selecting small datasets (roughly 10k samples) from small pools (100-200k samples). However, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples, subsampled from even larger data pools. We present a systematic study of how well data selection methods scale to these settings, selecting up to 2.5M samples from pools of up to 5.8M samples and evaluating across 7 diverse tasks. We show that many recently proposed methods fall short of random selection in this setting (while using more compute), and even decline in performance when given access to larger pools of data to select over. However, we find that a variant of representation-based data selection (RDS+), which uses weighted mean pooling of pretrained LM hidden states, consistently outperforms more complex methods across all settings tested -- all whilst being more compute-efficient. Our findings highlight that the scaling properties of proposed automated selection methods should be more closely examined. We release our code, data, and models at https://github.com/hamishivi/automated-instruction-selection.

Summary

AI-Generated Summary

PDF102March 4, 2025