通過模擬模型權重來評估數據選擇的樣本效用

摘要

基礎模型依賴於大規模網路爬蟲數據集，這些數據集通常包含噪音數據、偏見和無關內容。現有的數據選擇技術通常使用人類啟發法、下游評估數據集或專門的評分模型，可能會忽略在訓練過程中樣本的效用。相反，我們提出了一種新方法，Mimic Score，這是一種數據質量指標，利用預訓練的參考模型作為指南來評估數據樣本對訓練新模型的用處。它依賴於新模型參數的梯度與指向權重空間中參考模型的向量之間的對齊。與這個方向不一致的樣本被認為是低價值的，可以被過濾掉。受Mimic分數的啟發，我們開發了Grad-Mimic，一個數據選擇框架，用於識別和優先處理有用的樣本，自動化選擇過程以創建有效的過濾器。從實證來看，使用Mimic分數指導模型訓練在六個圖像數據集上實現了一致的性能提升，並增強了CLIP模型的性能。此外，Mimic分數及其相關的過濾器改進了現有的過濾方法，並提供了對數據集質量的準確估計。

English

Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.

通過模擬模型權重來評估數據選擇的樣本效用

Evaluating Sample Utility for Data Selection by Mimicking Model Weights

摘要

Support