ReasonFlux：通过扩展思维模板进行分层LLM推理

摘要

我们提出，通过扩展思维模板的分层LLM推理能够有效优化推理搜索空间，并且胜过OpenAI o1-preview和DeepSeek V3等强大LLM的数学推理能力。我们用仅8个GPU训练了我们的ReasonFlux-32B模型，并引入了三项创新：(i) 一个结构化且通用的思维模板库，包含约500个高级思维模板，能够泛化到类似或相关的推理问题；(ii) 在一系列思维模板上执行分层强化学习，而不是在长CoTs上进行，优化基础LLM以规划出逐渐处理复杂问题的最佳模板轨迹；(iii) 一种全新的推理扩展系统，通过在推理时自适应扩展思维模板，实现分层LLM推理。通过包含连续思维模板的模板轨迹，我们的ReasonFlux-32B显著提升了数学推理能力至最先进水平。值得注意的是，在MATH基准测试中，它实现了91.2%的准确率，比o1-preview高出6.7%。在美国数学奥林匹克（AIME）基准测试中，ReasonFlux-32B解决了平均56.7%的问题，分别比o1-preview和DeepSeek-V3高出27%和45%。代码：https://github.com/Gen-Verse/ReasonFlux

English

We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux

ReasonFlux：通过扩展思维模板进行分层LLM推理

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

摘要

Summary

Support