모든 예제를 프로그래밍: 전문가 수준의 규모로 사전 훈련 데이터 품질 향상

초록

대형 언어 모델 사전 훈련은 기존에는 인간 전문가가 말뭉치 품질을 향상시키기 위한 휴리스틱을 만들어 왔으며, 이로 인해 현재까지 다양한 규칙이 개발되었습니다. 그러나 이러한 규칙들은 개별 예제의 고유한 특성을 효과적으로 다루기에는 유연성이 부족합니다. 한편, 맞춤형 규칙을 각 예제에 적용하는 것은 인간 전문가에게는 비현실적입니다. 본 논문에서는 0.3B 매개변수만 있는 소형 언어 모델조차도 인간 전문가의 데이터 정제 능력과 유사한 상당한 능력을 발휘할 수 있다는 것을 증명합니다. 우리는 '모든 예제 프로그래밍(ProX)'이라는 새로운 프레임워크를 소개합니다. 이 프레임워크는 데이터 정제를 프로그래밍 작업으로 취급하여 모델이 각각의 예제에 대해 문자열 표준화와 같은 세부적인 작업을 생성하고 실행함으로써 규모에 맞게 말뭉치를 정제할 수 있게 합니다. 실험 결과는 ProX로 정제된 데이터에서 사전 훈련된 모델이 다양한 하향 벤치마크에서 다른 선택 방법으로 걸러낸 원본 데이터보다 2% 이상 우수한 성능을 보인다는 것을 보여줍니다. ProX의 효과는 C4, RedPajama-V2, FineWeb을 포함한 다양한 모델 크기와 사전 훈련 말뭉치에 걸쳐 확장됩니다. 또한 ProX는 도메인별 지속적 사전 훈련에서 상당한 잠재력을 보여줍니다. 도메인 특정 설계 없이 ProX로 정제된 OpenWebMath에서 훈련된 모델은 Mistral-7B보다 평균 정확도를 7.6% 향상시키며, Llama-2-7B에 대해서는 14.6%, CodeLlama-7B에 대해서는 20.3% 향상시킵니다. 이는 200B 토큰에 훈련된 Llemma-7B와 같은 모델과 비교 가능한 10B 토큰으로 이루어진 모델입니다. 추가 분석 결과 ProX는 훈련 FLOP를 상당히 절약하며, 효율적인 LLM 사전 훈련을 위한 유망한 방법을 제시합니다. 우리는 ProX를 100B 이상의 말뭉치, 모델과 함께 오픈소스로 공개하고, 재현 가능한 연구와 미래 혁신을 위해 모든 훈련 및 구현 세부 정보를 공유합니다. 코드: https://github.com/GAIR-NLP/ProX

English

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with >100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: https://github.com/GAIR-NLP/ProX

모든 예제를 프로그래밍: 전문가 수준의 규모로 사전 훈련 데이터 품질 향상

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

초록

Summary

Support

Support