通过自适应课程学习实现高效强化微调
Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
April 7, 2025
作者: Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, Jieyu Zhao
cs.AI
摘要
强化微调(Reinforcement Finetuning, RFT)在提升大型语言模型(LLMs)的数学推理能力方面展现出巨大潜力,但其通常样本和计算效率低下,需要大量训练。在本研究中,我们提出了AdaRFT(自适应课程强化微调),一种通过自适应课程学习显著提升RFT效率和最终准确性的方法。AdaRFT根据模型最近的奖励信号动态调整训练问题的难度,确保模型始终在具有挑战性但可解决的任务上进行训练。这种自适应采样策略通过维持最佳难度范围加速学习,避免在过于简单或困难的问题上浪费计算资源。AdaRFT仅需对标准RFT算法(如近端策略优化PPO)进行轻量级扩展,无需修改奖励函数或模型架构。在包括AMC、AIME及IMO风格问题在内的竞赛级数学数据集上的实验表明,AdaRFT显著提升了训练效率和推理性能。我们评估了AdaRFT在多种数据分布和模型规模下的表现,结果显示其最多可减少2倍的训练步数,并大幅提高准确率,为RFT提供了一个更具扩展性和有效性的框架。
English
Reinforcement finetuning (RFT) has shown great potential for enhancing the
mathematical reasoning capabilities of large language models (LLMs), but it is
often sample- and compute-inefficient, requiring extensive training. In this
work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a
method that significantly improves both the efficiency and final accuracy of
RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the
difficulty of training problems based on the model's recent reward signals,
ensuring that the model consistently trains on tasks that are challenging but
solvable. This adaptive sampling strategy accelerates learning by maintaining
an optimal difficulty range, avoiding wasted computation on problems that are
too easy or too hard. AdaRFT requires only a lightweight extension to standard
RFT algorithms like Proximal Policy Optimization (PPO), without modifying the
reward function or model architecture. Experiments on competition-level math
datasets-including AMC, AIME, and IMO-style problems-demonstrate that AdaRFT
significantly improves both training efficiency and reasoning performance. We
evaluate AdaRFT across multiple data distributions and model sizes, showing
that it reduces the number of training steps by up to 2x and improves accuracy
by a considerable margin, offering a more scalable and effective RFT framework.Summary
AI-Generated Summary