Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
Abstract
Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.
Summary
AI-Generated Summary
Paper Overview
Core Contribution
- Investigates the potential of reinforcement learning (RL) to improve reasoning in small language models (LLMs) under strict computational constraints.
- Demonstrates significant reasoning gains with minimal resources, achieving competitive performance at a fraction of the cost of baseline models.
- Provides actionable insights into RL-based fine-tuning for small LLMs, bridging the gap between theoretical advancements and real-world applicability.
- Releases open-source code and datasets to foster reproducibility and further exploration.
Research Context
- Focuses on enhancing reasoning capabilities of small LLMs (1.5 billion parameters) in resource-constrained settings.
- Addresses the limitations of large-scale approaches that rely on massive computational resources and datasets.
- Builds on prior work in RL and reasoning enhancement, adapting the Group Relative Policy Optimization (GRPO) algorithm for small LLMs.
Keywords
- Reinforcement Learning (RL)
- Small Language Models (LLMs)
- Reasoning Enhancement
- Resource Constraints
- Group Relative Policy Optimization (GRPO)
- Mathematical Reasoning
- Cost-Effective Training
Background
Research Gap
- Limited accessibility of large-scale reasoning enhancement methods due to high computational costs.
- Lack of research on RL-based fine-tuning for small LLMs under strict resource constraints.
- Need for scalable, cost-effective approaches to democratize advanced AI technologies.
Technical Challenges
- Optimization instability with prolonged training.
- Length constraints affecting reasoning depth.
- Multilingual drift in base models.
- Balancing data efficiency and performance.
Prior Approaches
- Supervised fine-tuning (SFT) and RL for reasoning enhancement in large LLMs.
- Chain-of-Thought (CoT) prompting and verification mechanisms.
- Process-based reward models and search algorithms like Monte Carlo Tree Search (MCTS).
Methodology
Technical Architecture
- Utilizes the Group Relative Policy Optimization (GRPO) algorithm for RL-based fine-tuning.
- Employs a compact, high-quality mathematical reasoning dataset for efficient training.
- Trains on a cluster of 4 NVIDIA A40 GPUs (48 GB VRAM each) within a 24-hour window.
Implementation Details
- Curates a dataset of 39,659 mathematical reasoning questions from existing sources.
- Adapts the GRPO algorithm to eliminate the need for a separate critic model, reducing computational overhead.
- Implements a rule-based reward system with accuracy, cosine, and format rewards.
Innovation Points
- Demonstrates rapid reasoning improvements with limited high-quality data.
- Introduces a mix of easy and hard problems to stabilize training and reduce completion lengths.
- Uses cosine rewards to control output length and improve training consistency.
Results
Experimental Setup
- Conducts three experiments to explore model behavior and performance under resource constraints.
- Evaluates performance on mathematical reasoning benchmarks: AMC23, AIME24, MATH-500, Minerva, and OlympiadBench.
- Compares results against baseline models, including Llama-3.1-70B-Instruct, o1-preview, and DeepScaleR-1.5B-Preview.
Key Findings
- Small LLMs achieve rapid reasoning improvements within 50–100 steps, but performance degrades with prolonged training.
- Incorporating a mix of easy and hard problems enhances early performance and stabilizes reasoning behavior.
- Cosine rewards stabilize completion lengths but require extended length limits for extremely hard tasks.
- Open-RS variants outperform most baselines, achieving competitive reasoning performance with minimal resources.
Limitations
- Training duration and length constraints limit the exploration of long-term behavior and complex tasks.
- Multilingual drift in base models complicates monolingual optimization.
- Evaluation focused exclusively on mathematical reasoning, leaving generalizability to other domains unexplored.