DUMP:基於強化學習的大型語言模型分佈級自動化課程學習後訓練
DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training
April 13, 2025
作者: Zhenting Wang, Guofeng Cui, Kun Wan, Wentian Zhao
cs.AI
摘要
基於強化學習(RL)的後訓練技術近期取得了顯著進展,特別是在提升大型語言模型(LLMs)處理複雜任務的推理能力方面。然而,現有方法大多將訓練數據視為一個整體,忽略了現代LLM訓練通常涉及來自不同分佈的混合數據——這些數據在來源和難度上均存在差異。這種異質性引入了一個關鍵挑戰:如何自適應地安排跨分佈的訓練以優化學習效率。本文提出了一種基於分佈層面可學習性概念的課程學習框架。我們的核心洞察是,策略優勢的大小反映了模型在特定分佈上進一步訓練所能獲益的程度。基於此,我們提出了一種用於RL基於LLM後訓練的分佈層面課程學習框架,該框架利用上置信界(UCB)原則動態調整不同分佈的採樣概率。此方法優先考慮具有高平均優勢(利用)或低樣本計數(探索)的分佈,從而產生一種自適應且理論基礎紮實的訓練計劃。我們以GRPO作為底層RL算法實例化了我們的課程學習框架,並在多難度和多來源的邏輯推理數據集上展示了其有效性。實驗結果表明,我們的框架顯著提高了收斂速度和最終性能,凸顯了分佈感知課程策略在LLM後訓練中的價值。代碼:https://github.com/ZhentingWang/DUMP。
English
Recent advances in reinforcement learning (RL)-based post-training have led
to notable improvements in large language models (LLMs), particularly in
enhancing their reasoning capabilities to handle complex tasks. However, most
existing methods treat the training data as a unified whole, overlooking the
fact that modern LLM training often involves a mixture of data from diverse
distributions-varying in both source and difficulty. This heterogeneity
introduces a key challenge: how to adaptively schedule training across
distributions to optimize learning efficiency. In this paper, we present a
principled curriculum learning framework grounded in the notion of
distribution-level learnability. Our core insight is that the magnitude of
policy advantages reflects how much a model can still benefit from further
training on a given distribution. Based on this, we propose a
distribution-level curriculum learning framework for RL-based LLM
post-training, which leverages the Upper Confidence Bound (UCB) principle to
dynamically adjust sampling probabilities for different distrubutions. This
approach prioritizes distributions with either high average advantage
(exploitation) or low sample count (exploration), yielding an adaptive and
theoretically grounded training schedule. We instantiate our curriculum
learning framework with GRPO as the underlying RL algorithm and demonstrate its
effectiveness on logic reasoning datasets with multiple difficulties and
sources. Our experiments show that our framework significantly improves
convergence speed and final performance, highlighting the value of
distribution-aware curriculum strategies in LLM post-training. Code:
https://github.com/ZhentingWang/DUMP.Summary
AI-Generated Summary