YuLan-Mini: オープンでデータ効率の良い言語モデル

要旨

大規模言語モデル（LLM）の効果的な事前学習は、膨大なリソース要求と関連する技術プロセスの複雑さにより、困難を極めてきました。本論文では、2.42Bのパラメータを持つ高性能な基本モデルであるYuLan-Miniについて、同様のパラメータ規模のモデルの中で最高水準の性能を達成する詳細な技術レポートを提供します。私たちの事前学習アプローチは、トレーニングの効果を向上させることに焦点を当て、次の3つの主要な技術的貢献によって行われます：データクリーニングとデータスケジュール戦略を組み合わせた緻密なデータパイプライン、トレーニングの不安定性を緩和するための頑健な最適化手法、そして、ターゲットデータ選択と長いコンテキストトレーニングを組み込んだ効果的なアニーリング手法。驚くべきことに、1.08TトークンでトレーニングされたYuLan-Miniは、はるかに多くのデータを必要とする業界をリードするモデルと同等の性能を達成しています。再現性を促進するために、各トレーニングフェーズのデータ構成の詳細を公開しています。プロジェクトの詳細は、以下のリンクからアクセスできます：https://github.com/RUC-GSAI/YuLan-Mini.

English

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

YuLan-Mini: オープンでデータ効率の良い言語モデル

YuLan-Mini: An Open Data-efficient Language Model

要旨

Summary

Support