Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

March 20, 2025

Authors: Quy-Anh Dang, Chris Ngo

cs.AI

Abstract

Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.

Summary

AI-Generated Summary

Paper Overview

Core Contribution

Investigates the potential of reinforcement learning (RL) to improve reasoning in small language models (LLMs) under strict computational constraints.
Demonstrates significant reasoning gains with minimal resources, achieving competitive performance at a fraction of the cost of baseline models.
Provides actionable insights into RL-based fine-tuning for small LLMs, bridging the gap between theoretical advancements and real-world applicability.
Releases open-source code and datasets to foster reproducibility and further exploration.

Research Context

Focuses on enhancing reasoning capabilities of small LLMs (1.5 billion parameters) in resource-constrained settings.
Addresses the limitations of large-scale approaches that rely on massive computational resources and datasets.
Builds on prior work in RL and reasoning enhancement, adapting the Group Relative Policy Optimization (GRPO) algorithm for small LLMs.

Keywords

Reinforcement Learning (RL)
Small Language Models (LLMs)
Reasoning Enhancement
Resource Constraints
Group Relative Policy Optimization (GRPO)
Mathematical Reasoning
Cost-Effective Training

Background

Research Gap

Limited accessibility of large-scale reasoning enhancement methods due to high computational costs.
Lack of research on RL-based fine-tuning for small LLMs under strict resource constraints.
Need for scalable, cost-effective approaches to democratize advanced AI technologies.

Technical Challenges

Optimization instability with prolonged training.
Length constraints affecting reasoning depth.
Multilingual drift in base models.
Balancing data efficiency and performance.

Prior Approaches

Supervised fine-tuning (SFT) and RL for reasoning enhancement in large LLMs.
Chain-of-Thought (CoT) prompting and verification mechanisms.
Process-based reward models and search algorithms like Monte Carlo Tree Search (MCTS).

Methodology

Technical Architecture

Utilizes the Group Relative Policy Optimization (GRPO) algorithm for RL-based fine-tuning.
Employs a compact, high-quality mathematical reasoning dataset for efficient training.
Trains on a cluster of 4 NVIDIA A40 GPUs (48 GB VRAM each) within a 24-hour window.

Implementation Details

Curates a dataset of 39,659 mathematical reasoning questions from existing sources.
Adapts the GRPO algorithm to eliminate the need for a separate critic model, reducing computational overhead.
Implements a rule-based reward system with accuracy, cosine, and format rewards.

Innovation Points

Demonstrates rapid reasoning improvements with limited high-quality data.
Introduces a mix of easy and hard problems to stabilize training and reduce completion lengths.
Uses cosine rewards to control output length and improve training consistency.

Results

Experimental Setup

Conducts three experiments to explore model behavior and performance under resource constraints.
Evaluates performance on mathematical reasoning benchmarks: AMC23, AIME24, MATH-500, Minerva, and OlympiadBench.
Compares results against baseline models, including Llama-3.1-70B-Instruct, o1-preview, and DeepScaleR-1.5B-Preview.

Key Findings

Small LLMs achieve rapid reasoning improvements within 50–100 steps, but performance degrades with prolonged training.
Incorporating a mix of easy and hard problems enhances early performance and stabilizes reasoning behavior.
Cosine rewards stabilize completion lengths but require extended length limits for extremely hard tasks.
Open-RS variants outperform most baselines, achieving competitive reasoning performance with minimal resources.

Limitations

Training duration and length constraints limit the exploration of long-term behavior and complex tasks.
Multilingual drift in base models complicates monolingual optimization.
Evaluation focused exclusively on mathematical reasoning, leaving generalizability to other domains unexplored.

Featured Papers

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei•Feb 27, 2024•611142

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang•Jan 22, 2025•3625

Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu•Dec 19, 2024•36111