모델 가중치를 모방하여 데이터 선택을 위한 샘플 유틸리티 평가

초록

기초 모델은 대규모 웹 크롤링 데이터셋에 의존하며, 이는 종종 잡음이 있는 데이터, 편향 및 관련 없는 콘텐츠를 포함합니다. 기존 데이터 선택 기술은 일반적으로 인간의 경험 규칙, 하류 평가 데이터셋 또는 전문화된 점수 모델을 사용하며, 훈련 과정에서 샘플의 유효성을 간과할 수 있습니다. 대신, 저희는 새로운 방법을 제안합니다. Mimic Score는 데이터 품질 측정 지표로, 사전 훈련된 참조 모델을 가이드로 활용하여 새 모델 훈련에 대한 데이터 샘플의 유용성을 평가합니다. 이는 새 모델 매개변수의 그래디언트와 가중치 공간에서 참조 모델을 향하는 벡터 사이의 정렬에 의존합니다. 이 방향과 정렬되지 않는 샘플은 저가치로 간주되어 필터링될 수 있습니다. Mimic score에 영감을 받아, 유용한 샘플을 식별하고 우선순위를 지정하여 효과적인 필터를 생성하는 데이터 선택 프레임워크인 Grad-Mimic을 개발합니다. 경험적으로, Mimic 점수를 사용하여 모델 훈련을 가이드하면 여섯 개의 이미지 데이터셋에서 일관된 성능 향상을 이끌어내며, CLIP 모델의 성능을 향상시킵니다. 더불어, Mimic 점수와 관련된 필터는 기존 필터링 방법을 개선하고 데이터셋 품질을 정확하게 추정합니다.

English

Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.

모델 가중치를 모방하여 데이터 선택을 위한 샘플 유틸리티 평가

Evaluating Sample Utility for Data Selection by Mimicking Model Weights

초록

Summary

Support