推理语言模型：一个蓝图

摘要

推理语言模型（RLMs），也称为大型推理模型（LRMs），如OpenAI的o1和o3，DeepSeek-V3和阿里巴巴的QwQ，通过将大型语言模型（LLMs）与先进的推理机制相结合，重新定义了人工智能的问题解决能力。然而，它们高昂的成本、专有性质和复杂的架构 - 独特地结合了强化学习（RL）、搜索启发式和LLMs - 提出了可访问性和可扩展性挑战。为了解决这些问题，我们提出了一个全面的蓝图，将RLM组件组织成一个模块化框架，基于对所有RLM作品的调查和分析。这个蓝图包括各种推理结构（链、树、图和嵌套形式）、推理策略（如蒙特卡洛树搜索、波束搜索）、RL概念（策略、价值模型等）和监督方案（基于输出和基于过程的监督）。我们还提供了详细的数学公式和算法规范，以简化RLM的实现。通过展示像LLaMA-Berry、QwQ、Journey Learning和思维图等方案如何适用作为特例，我们展示了蓝图的多功能性和统一潜力。为了说明其实用性，我们引入了x1，一个用于快速RLM原型设计和实验的模块化实现。利用x1和文献综述，我们提供了关键见解，如针对策略和价值模型的多阶段训练，以及熟悉训练分布的重要性。最后，我们概述了RLMs如何与更广泛的LLM生态系统集成，包括工具和数据库。我们的工作揭示了RLM的构建过程，使先进的推理能力民主化，并促进创新，旨在通过降低RLM开发和实验的门槛来缓解“富裕AI”和“贫穷AI”之间的差距。

English

Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have redefined AI's problem-solving capabilities by extending large language models (LLMs) with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining Reinforcement Learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), and supervision schemes (Output-Based and Process-Based Supervision). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint's versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we outline how RLMs can integrate with a broader LLM ecosystem, including tools and databases. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between "rich AI" and "poor AI" by lowering barriers to RLM development and experimentation.

推理语言模型：一个蓝图

Reasoning Language Models: A Blueprint

摘要

Summary

Support

Support