ChatPaper.aiChatPaper

DUMP:基於強化學習的大型語言模型分佈級自動化課程學習後訓練

DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training

April 13, 2025
作者: Zhenting Wang, Guofeng Cui, Kun Wan, Wentian Zhao
cs.AI

摘要

基於強化學習(RL)的後訓練技術近期取得了顯著進展,特別是在提升大型語言模型(LLMs)處理複雜任務的推理能力方面。然而,現有方法大多將訓練數據視為一個整體,忽略了現代LLM訓練通常涉及來自不同分佈的混合數據——這些數據在來源和難度上均存在差異。這種異質性引入了一個關鍵挑戰:如何自適應地安排跨分佈的訓練以優化學習效率。本文提出了一種基於分佈層面可學習性概念的課程學習框架。我們的核心洞察是,策略優勢的大小反映了模型在特定分佈上進一步訓練所能獲益的程度。基於此,我們提出了一種用於RL基於LLM後訓練的分佈層面課程學習框架,該框架利用上置信界(UCB)原則動態調整不同分佈的採樣概率。此方法優先考慮具有高平均優勢(利用)或低樣本計數(探索)的分佈,從而產生一種自適應且理論基礎紮實的訓練計劃。我們以GRPO作為底層RL算法實例化了我們的課程學習框架,並在多難度和多來源的邏輯推理數據集上展示了其有效性。實驗結果表明,我們的框架顯著提高了收斂速度和最終性能,凸顯了分佈感知課程策略在LLM後訓練中的價值。代碼:https://github.com/ZhentingWang/DUMP。
English
Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.

Summary

AI-Generated Summary

PDF151April 15, 2025