WILDCHAT-50M:关于合成数据在后训练中的作用的深入探讨
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
January 30, 2025
作者: Benjamin Feuer, Chinmay Hegde
cs.AI
摘要
语言模型(LLM)的后训练,从DPO到蒸馏,可以优化行为并开发新技能,但支持这些后训练技术的开放科学仍处于萌芽阶段。一个限制因素是难以进行大规模的合成数据生成模型和LLM评估模型的比较分析。为了弥补这一差距,我们介绍了迄今为止最大的公共聊天数据集WILDCHAT-50M。我们扩展了现有的WildChat数据集,不仅包括来自GPT的回复,还包括来自超过50种不同的开放权重模型,其参数规模从0.5B到104B不等。我们进行了广泛的比较分析,并通过创建RE-WILD展示了这一数据集的潜力,我们自己的公共SFT混合物仅使用了Allen AI最近的Tulu-3 SFT混合物样本数量的40%就取得了更好的表现。我们的数据集、样本和代码可在https://github.com/penfever/wildchat-50m 上获取。
English
Language model (LLM) post-training, from DPO to distillation, can refine
behaviors and unlock new skills, but the open science supporting these
post-training techniques is still in its infancy. One limiting factor has been
the difficulty of conducting large-scale comparative analyses of synthetic data
generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M,
the largest public chat dataset to date. We extend the existing WildChat
dataset to include responses not only from GPT, but from over 50 different
open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an
extensive comparative analysis and demonstrate the potential of this dataset by
creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3
SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples
and code are available at https://github.com/penfever/wildchat-50m.Summary
AI-Generated Summary