QuaDMix:面向高效大語言模型預訓練的質量-多樣性平衡數據選擇
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
April 23, 2025
作者: Fengze Liu, Weidong Zhou, Binbin Liu, Zhimiao Yu, Yifan Zhang, Haobin Lin, Yifeng Yu, Xiaohuan Zhou, Taifeng Wang, Yong Cao
cs.AI
摘要
質量和多樣性是大型語言模型(LLMs)訓練數據的兩個關鍵指標,對模型性能有積極影響。現有研究通常分別優化這些指標,通常先進行質量過濾,然後調整數據比例。然而,這些方法忽略了質量和多樣性之間的固有權衡,需要將二者綜合考慮。在固定的訓練配額下,評估每個數據點的質量及其對整體數據集的互補效應至關重要。本文提出了一個名為QuaDMix的統一數據選擇框架,該框架在平衡質量和多樣性的同時,自動優化LLM預訓練的數據分佈。具體而言,我們首先提出了多個標準來衡量數據質量,並使用領域分類來區分數據點,從而衡量整體多樣性。QuaDMix隨後採用了一個統一的參數化數據採樣函數,該函數基於這些與質量和多樣性相關的標籤來確定每個數據點的採樣概率。為了加速QuaDMix框架中最佳參數的搜索,我們在較小模型上進行了模擬實驗,並受RegMix方法的啟發,使用LightGBM進行參數搜索。我們在多種模型和數據集上的實驗表明,QuaDMix在多個基準測試中平均提升了7.2%的性能。這些結果優於獨立優化質量和多樣性的策略,凸顯了平衡數據質量和多樣性的必要性和能力。
English
Quality and diversity are two critical metrics for the training data of large
language models (LLMs), positively impacting performance. Existing studies
often optimize these metrics separately, typically by first applying quality
filtering and then adjusting data proportions. However, these approaches
overlook the inherent trade-off between quality and diversity, necessitating
their joint consideration. Given a fixed training quota, it is essential to
evaluate both the quality of each data point and its complementary effect on
the overall dataset. In this paper, we introduce a unified data selection
framework called QuaDMix, which automatically optimizes the data distribution
for LLM pretraining while balancing both quality and diversity. Specifically,
we first propose multiple criteria to measure data quality and employ domain
classification to distinguish data points, thereby measuring overall diversity.
QuaDMix then employs a unified parameterized data sampling function that
determines the sampling probability of each data point based on these quality
and diversity related labels. To accelerate the search for the optimal
parameters involved in the QuaDMix framework, we conduct simulated experiments
on smaller models and use LightGBM for parameters searching, inspired by the
RegMix method. Our experiments across diverse models and datasets demonstrate
that QuaDMix achieves an average performance improvement of 7.2% across
multiple benchmarks. These results outperform the independent strategies for
quality and diversity, highlighting the necessity and ability to balance data
quality and diversity.Summary
AI-Generated Summary