Craw4LLM: LLM 사전 학습을 위한 효율적인 웹 크롤링

초록

웹 크롤링은 대규모 언어 모델(LLM)의 사전 학습 데이터의 주요 공급원이지만, 데이터 품질이 낮아 대부분의 크롤링된 웹 페이지는 사전 학습 과정에서 폐기됩니다. 본 논문은 LLM 사전 학습의 선호도를 기반으로 웹 그래프를 탐색하는 효율적인 웹 크롤링 방법인 Crawl4LLM을 제안합니다. 구체적으로, 이 방법은 웹 페이지의 LLM 사전 학습에 미치는 영향을 웹 크롤러 스케줄러의 우선순위 점수로 활용하여, 기존의 그래프 연결성 기반 우선순위를 대체합니다. 상용 검색 엔진의 인덱스에서 추출한 9억 개의 웹 페이지로 구성된 웹 그래프에 대한 실험을 통해, Crawl4LLM이 고품질 사전 학습 데이터를 획득하는 데 있어 효율적임을 입증했습니다. 단지 21%의 URL만 크롤링함으로써, Crawl4LLM 데이터로 사전 학습된 LLM은 이전 크롤링과 동등한 다운스트림 성능을 달성하여 크롤링 낭비를 크게 줄이고 웹사이트에 대한 부담을 완화했습니다. 본 연구의 코드는 https://github.com/cxcscmu/Crawl4LLM에서 공개되어 있습니다.

English

Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Crawl4LLM.

Craw4LLM: LLM 사전 학습을 위한 효율적인 웹 크롤링

Craw4LLM: Efficient Web Crawling for LLM Pretraining

초록

Support