SmolTulu:更高的学习率与批量大小比率可能会导致在SLM中更好的推理能力。
SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs
December 11, 2024
作者: Sultan Alrashed
cs.AI
摘要
我们介绍了SmolTulu-1.7b-Instruct,本报告中称为SmolTulu-DPO-1130,这是一个经过指导调整的语言模型,它调整了AllenAI的Tulu 3后训练流程,以增强Huggingface的SmolLM2-1.7B基础模型。通过使用一个拥有1.35亿参数的模型进行全面的实证分析,我们证明了学习率和批量大小之间的关系在不同任务下显著影响模型性能。我们的发现揭示了一个明显的分歧:像ARC和GSM8K这样的推理任务受益于较高的学习率与批量大小比率,而像HellaSwag和IFEval这样的模式识别任务则表现出较低比率时的最佳性能。这些见解指导了SmolTulu的开发,该模型在指令遵循方面实现了在小于20亿参数模型中的最新性能,IFEval得分为67.7%(Delta11%),在GSM8K上的数学推理得分为51.6%(Delta3.4%),另一个版本在ARC上得分为57.1%(Delta5.4%)。我们发布了我们的模型、训练配方和消融研究,以促进进一步研究高效模型对齐,表明优化动态的精心调整可以帮助弥合小型和大型语言模型之间的能力差距。
English
We present SmolTulu-1.7b-Instruct, referenced in this report as
SmolTulu-DPO-1130, an instruction-tuned language model that adapts AllenAI's
Tulu 3 post-training pipeline to enhance Huggingface's SmolLM2-1.7B base model.
Through comprehensive empirical analysis using a 135M parameter model, we
demonstrate that the relationship between learning rate and batch size
significantly impacts model performance in a task-dependent manner. Our
findings reveal a clear split: reasoning tasks like ARC and GSM8K benefit from
higher learning rate to batch size ratios, while pattern recognition tasks such
as HellaSwag and IFEval show optimal performance with lower ratios. These
insights informed the development of SmolTulu, which achieves state-of-the-art
performance among sub-2B parameter models on instruction following, scoring
67.7% on IFEval (Delta11%), and mathematical reasoning with 51.6% on GSM8K
(Delta3.4%), with an alternate version achieving scoring 57.1% on ARC
(Delta5.4%). We release our model, training recipes, and ablation studies to
facilitate further research in efficient model alignment, demonstrating that
careful adaptation of optimization dynamics can help bridge the capability gap
between small and large language models.Summary
AI-Generated Summary