ZIP-FIT: 압축 기반 정렬을 통한 임베딩 없는 데이터 선택

초록

데이터 선택은 특정 작업에서 언어 모델(LM) 성능을 최적화하는 데 중요하지만 대부분의 기존 방법은 대상 작업 분포를 효과적으로 고려하지 못하는 것으로 나타났습니다. 현재 접근 방식은 대상 작업에 대한 특정 요구 사항을 완전히 무시하거나 Autoformalization이나 코드 생성과 같은 작업에 필요한 미묘한 패턴을 포착하지 못하는 근사치에 의존하는 경향이 있습니다. 대상 분포를 고려하는 방법은 종종 해시된 n-그램 피처와 같은 단순하고 때로는 노이즈가 있는 표현에 의존하는데, 이는 충돌을 일으키고 노이즈를 도입할 수 있습니다. 우리는 ZIP-FIT이라는 데이터 선택 프레임워크를 소개합니다. 이는 gzip 압축을 사용하여 잠재적인 훈련 데이터와 대상 작업 분포 간의 정렬을 직접 측정합니다. Autoformalization 및 Python 코드 생성에 대한 광범위한 평가에서 ZIP-FIT은 DSIR 및 D4와 같은 선도적인 베이스라인을 크게 능가합니다. ZIP-FIT으로 훈련된 모델은 베이스라인보다 최대 85.1% 빠른 속도로 최저 교차 엔트로피 손실을 달성하며, 더 나은 작업 정렬이 더 효율적인 학습으로 이어진다는 것을 입증합니다. 또한 ZIP-FIT은 DSIR보다 최대 65.8% 빠르게 선택을 수행하며 D4보다 2개 순서 빠르게 작동합니다. 특히, ZIP-FIT은 작은 크기이지만 잘 정렬된 데이터 세트가 종종 큰 크기이지만 덜 특정된 데이터 세트보다 우수하다는 것을 보여주며, 더 높은 품질의 소량 데이터가 낮은 품질의 대량 데이터보다 우수하다는 것을 입증합니다. 우리의 결과는 효율적인 도메인 적응을 위해 작업 인식 데이터 선택이 중요하며, 압축이 작업 정렬을 측정하는 원칙적인 방법을 제공한다는 것을 시사합니다. 타겟팅된 데이터 선택이 작업별 성능을 현저히 향상시킬 수 있다는 것을 보여줌으로써, 우리의 연구는 데이터 품질, 작업 정렬 및 모델 학습 효율성 사이의 관계에 대한 새로운 통찰을 제공합니다.

English

Data selection is crucial for optimizing language model (LM) performance on specific tasks, yet most existing methods fail to effectively consider the target task distribution. Current approaches either ignore task-specific requirements entirely or rely on approximations that fail to capture the nuanced patterns needed for tasks like Autoformalization or code generation. Methods that do consider the target distribution often rely on simplistic, sometimes noisy, representations, like hashed n-gram features, which can lead to collisions and introduce noise. We introduce ZIP-FIT, a data selection framework that uses gzip compression to directly measure alignment between potential training data and the target task distribution. In extensive evaluations on Autoformalization and Python code generation, ZIP-FIT significantly outperforms leading baselines like DSIR and D4. Models trained on ZIP-FIT-selected data achieve their lowest cross-entropy loss up to 85.1\% faster than baselines, demonstrating that better task alignment leads to more efficient learning. In addition, ZIP-FIT performs selection up to 65.8\% faster than DSIR and two orders of magnitude faster than D4. Notably, ZIP-FIT shows that smaller, well-aligned datasets often outperform larger but less targeted ones, demonstrating that a small amount of higher quality data is superior to a large amount of lower quality data. Our results imply that task-aware data selection is crucial for efficient domain adaptation, and that compression offers a principled way to measure task alignment. By showing that targeted data selection can dramatically improve task-specific performance, our work provides new insights into the relationship between data quality, task alignment, and model learning efficiency.

ZIP-FIT: 압축 기반 정렬을 통한 임베딩 없는 데이터 선택

ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

초록

Summary

Support