透過適應性課程學習實現高效強化微調

摘要

強化微調（Reinforcement Finetuning, RFT）在提升大型語言模型（LLMs）的數學推理能力方面展現了巨大潛力，但其往往樣本和計算效率低下，需要大量訓練。在本研究中，我們引入了AdaRFT（自適應課程強化微調），這是一種通過自適應課程學習顯著提升RFT效率和最終準確性的方法。AdaRFT根據模型最近的獎勵信號動態調整訓練問題的難度，確保模型始終在具有挑戰性但可解決的任務上進行訓練。這種自適應採樣策略通過維持最佳難度範圍來加速學習，避免在過於簡單或過於困難的問題上浪費計算資源。AdaRFT僅需對標準RFT算法（如近端策略優化，PPO）進行輕量級擴展，無需修改獎勵函數或模型架構。在競賽級數學數據集（包括AMC、AIME和IMO風格問題）上的實驗表明，AdaRFT顯著提升了訓練效率和推理性能。我們在多種數據分佈和模型規模下評估AdaRFT，結果顯示其將訓練步數減少至多2倍，並大幅提高準確性，提供了一個更具可擴展性和有效性的RFT框架。

English

Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets-including AMC, AIME, and IMO-style problems-demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces the number of training steps by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.

透過適應性課程學習實現高效強化微調

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

摘要

Summary

Support

Support