SampleMix:一种通过协调数据质量与多样性实现的样本级预训练数据混合策略
SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
March 3, 2025
作者: Xiangyu Xi, Deyang Kong, Jian Yang, Jiawei Yang, Zhengyu Chen, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye
cs.AI
摘要
现有的大型语言模型(LLMs)预训练数据混合方法通常采用领域导向的策略,这是一种自上而下的过程,首先确定各领域的权重,然后在每个领域内进行均匀的数据采样。然而,这些方法忽视了领域间显著的交叉与共性,未能有效控制构建训练数据集的全局多样性。此外,领域内的均匀采样忽略了样本细粒度的特定特征,可能导致数据分布不尽理想。针对这些不足,我们提出了一种基于自下而上范式的新型样本级数据混合方法。该方法通过系统评估每个样本的质量与多样性,实现跨领域的全局采样,从而动态确定最优的领域分布。在多个下游任务及困惑度评估中的全面实验表明,SampleMix超越了现有的基于领域的方法。同时,SampleMix需要1.4倍至2.1倍的训练步数以达到基线性能,这凸显了SampleMix在优化预训练数据方面的巨大潜力。
English
Existing pretraining data mixing methods for large language models (LLMs)
typically follow a domain-wise methodology, a top-down process that first
determines domain weights and then performs uniform data sampling across each
domain. However, these approaches neglect significant inter-domain overlaps and
commonalities, failing to control the global diversity of the constructed
training dataset. Further, uniform sampling within domains ignores fine-grained
sample-specific features, potentially leading to suboptimal data distribution.
To address these shortcomings, we propose a novel sample-wise data mixture
approach based on a bottom-up paradigm. This method performs global
cross-domain sampling by systematically evaluating the quality and diversity of
each sample, thereby dynamically determining the optimal domain distribution.
Comprehensive experiments across multiple downstream tasks and perplexity
assessments demonstrate that SampleMix surpasses existing domain-based methods.
Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the
baselines' performance, highlighting the substantial potential of SampleMix to
optimize pre-training data.Summary
AI-Generated Summary