TÜLU 3：推动开放式语言模型后训练的前沿

摘要

语言模型后训练被应用于优化行为并开发新技能，涵盖了广泛的最新语言模型，但是关于应用这些技术的开放式指南仍落后于专有指南。后训练的基础训练数据和指南同时是谜团中最重要的部分，也是最缺乏透明度的部分。为了弥合这一差距，我们介绍了T\"ULU 3，这是一系列全面开放的最先进后训练模型，包括其数据、代码和训练指南，作为现代后训练技术的全面指南。T\"ULU 3基于Llama 3.1基础模型，取得了超越Llama 3.1、Qwen 2.5、Mistral甚至GPT-4o-mini和Claude 3.5-Haiku等封闭模型的结果。我们模型的训练算法包括监督微调（SFT）、直接偏好优化（DPO）以及我们称之为具有可验证奖励的强化学习（RLVR）的新方法。通过T\"ULU 3，我们引入了一个多任务评估方案，用于后训练指南的开发和未知评估，标准基准实现，以及对所述基准上现有开放数据集的实质性净化。最后，我们对未能可靠提升性能的训练方法进行了分析和讨论。除了T\"ULU 3模型权重和演示之外，我们还发布了完整的指南，包括用于多样核心技能的数据集、用于数据整理和评估的强大工具包、训练代码和基础设施，以及最重要的是，一份详细报告，用于复制和进一步调整T\"ULU 3方法以适应更多领域。

English

Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce T\"ULU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. T\"ULU 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With T\"ULU 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. In addition to the T\"ULU 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the T\"ULU 3 approach to more domains.

TÜLU 3：推动开放式语言模型后训练的前沿

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

摘要

Summary

Support

Support