通过自适应课程学习实现高效强化微调

摘要

强化微调（Reinforcement Finetuning, RFT）在提升大型语言模型（LLMs）的数学推理能力方面展现出巨大潜力，但其通常样本和计算效率低下，需要大量训练。在本研究中，我们提出了AdaRFT（自适应课程强化微调），一种通过自适应课程学习显著提升RFT效率和最终准确性的方法。AdaRFT根据模型最近的奖励信号动态调整训练问题的难度，确保模型始终在具有挑战性但可解决的任务上进行训练。这种自适应采样策略通过维持最佳难度范围加速学习，避免在过于简单或困难的问题上浪费计算资源。AdaRFT仅需对标准RFT算法（如近端策略优化PPO）进行轻量级扩展，无需修改奖励函数或模型架构。在包括AMC、AIME及IMO风格问题在内的竞赛级数学数据集上的实验表明，AdaRFT显著提升了训练效率和推理性能。我们评估了AdaRFT在多种数据分布和模型规模下的表现，结果显示其最多可减少2倍的训练步数，并大幅提高准确率，为RFT提供了一个更具扩展性和有效性的框架。

English

Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets-including AMC, AIME, and IMO-style problems-demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces the number of training steps by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.

通过自适应课程学习实现高效强化微调

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

摘要

Summary

Support

Support