搜索推理-R1：通过强化学习训练大语言模型进行推理并利用搜索引擎

摘要

高效获取外部知识与最新信息对于大型语言模型（LLMs）进行有效推理和文本生成至关重要。检索增强与工具使用训练方法，如将搜索引擎视为工具，往往缺乏复杂的多轮检索灵活性，或需要大规模监督数据。在推理过程中提示具备推理能力的高级LLMs使用搜索引擎并非最佳方案，因为LLM并未学会如何与搜索引擎进行最优交互。本文介绍了Search-R1，作为DeepSeek-R1模型的扩展，其中LLM仅通过强化学习（RL）自主生成（多个）搜索查询，在逐步推理过程中实现实时检索。Search-R1通过多轮搜索交互优化LLM的展开过程，利用检索到的令牌掩码确保RL训练的稳定性，并采用基于结果的简单奖励函数。在七个问答数据集上的实验表明，Search-R1相较于当前最优基线，性能分别提升了26%（Qwen2.5-7B）、21%（Qwen2.5-3B）和10%（LLaMA3.2-3B）。本文还进一步提供了关于RL优化方法、LLM选择及检索增强推理中响应长度动态的实证见解。代码与模型检查点可在https://github.com/PeterGriffinJin/Search-R1获取。

English

Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Retrieval augmentation and tool-use training approaches where a search engine is treated as a tool lack complex multi-turn retrieval flexibility or require large-scale supervised data. Prompting advanced LLMs with reasoning capabilities during inference to use search engines is not optimal, since the LLM does not learn how to optimally interact with the search engine. This paper introduces Search-R1, an extension of the DeepSeek-R1 model where the LLM learns -- solely through reinforcement learning (RL) -- to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over SOTA baselines. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

搜索推理-R1：通过强化学习训练大语言模型进行推理并利用搜索引擎

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

摘要

Summary

Support