モデルの重みを模倣してデータ選択のためのサンプル有用性を評価する

要旨

ファウンデーションモデルは、しばしばノイズのあるデータ、バイアス、および関連性のないコンテンツを含む大規模なWebクロールデータセットに依存しています。既存のデータ選択手法は、通常、人間のヒューリスティック、下流の評価データセット、または専門のスコアリングモデルを使用しており、トレーニングプロセスでサンプルの有用性を見落とすことがあります。代わりに、我々は新しいアプローチ、Mimic Scoreを提案します。これは、事前学習された参照モデルをガイドとして使用し、新しいモデルのトレーニングにおけるデータサンプルの有用性を評価するデータ品質メトリックです。これは、新しいモデルパラメータの勾配と、ウェイト空間で参照モデルを指すベクトルとの整合性に依存しています。この方向と一致しないサンプルは低価値と見なされ、除外される可能性があります。Mimicスコアに触発され、有用なサンプルを特定し優先順位付けするデータ選択フレームワークであるGrad-Mimicを開発します。これにより、効果的なフィルタを作成するための選択プロセスが自動化されます。経験的には、Mimicスコアを使用してモデルトレーニングをガイドすることで、6つの画像データセット全体で一貫したパフォーマンス向上が見られ、CLIPモデルのパフォーマンスも向上します。さらに、Mimicスコアとそれに関連するフィルタは、既存のフィルタリング方法を改善し、データセット品質の正確な推定を提供します。

English

Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.

モデルの重みを模倣してデータ選択のためのサンプル有用性を評価する

Evaluating Sample Utility for Data Selection by Mimicking Model Weights

要旨

Support