Steel-LLM: 처음부터 오픈 소스로 -- 중국 중심의 LLM 구축에 대한 개인적인 여정

초록

Steel-LLM은 한국 중심의 언어 모델로, 한정된 계산 자원에도 불구하고 고품질의 오픈 소스 모델을 개발하기 위해 처음부터 개발되었습니다. 2024년 3월에 출시된 이 프로젝트는 대규모 데이터셋에서 10억 개 파라미터 모델을 훈련시키는 것을 목표로 하였으며, 커뮤니티 내 다른 이들을 돕기 위해 투명성과 실용적인 통찰을 공유하는 것을 중점으로 두었습니다. 훈련 과정은 주로 중국어 데이터에 초점을 맞추었으며, 일부 영어 데이터도 포함하여 기존의 오픈 소스 LLM의 미흡한 점을 보완하고 모델 구축 여정에 대해 더 자세하고 실용적인 설명을 제공하였습니다. Steel-LLM은 CEVAL 및 CMMLU와 같은 벤치마크에서 우수한 성능을 보여주었으며, 대규모 기관의 초기 모델을 능가하였습니다. 본 논문은 프로젝트의 주요 기여 사항인 데이터 수집, 모델 설계, 훈련 방법론 및 진행 중 마주한 어려움에 대한 포괄적인 요약을 제공하며, 자신의 LLM을 개발하려는 연구자와 실무자들에게 유용한 자료를 제공합니다. 모델 체크포인트와 훈련 스크립트는 https://github.com/zhanshijinwat/Steel-LLM에서 확인할 수 있습니다.

English

Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at https://github.com/zhanshijinwat/Steel-LLM.

Steel-LLM: 처음부터 오픈 소스로 -- 중국 중심의 LLM 구축에 대한 개인적인 여정

Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM

초록

Support