M1:迈向可扩展的测试时计算——基于Mamba推理模型
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
April 14, 2025
作者: Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao
cs.AI
摘要
高效推理对于解决复杂数学问题至关重要。近期,大型语言模型(LLMs)通过扩展测试时的计算量,利用长链思维推理显著提升了性能。然而,基于Transformer的模型在扩展上下文长度方面存在固有局限,这源于其二次方的计算复杂度和线性的内存需求。本文中,我们提出了一种新型混合线性RNN推理模型M1,该模型基于Mamba架构,实现了内存高效的推理。我们的方法通过从现有推理模型中进行知识蒸馏,并进一步通过强化学习训练加以增强。在AIME和MATH基准测试上的实验结果表明,M1不仅超越了以往的线性RNN模型,还在同等规模下与最先进的Deepseek R1蒸馏推理模型性能相当。我们还与高性能通用推理引擎vLLM进行了生成速度对比,发现相较于同规模Transformer,M1实现了超过3倍的加速。凭借吞吐量的提升,在固定生成时间预算下,通过自一致性投票,我们能够获得比DeepSeek R1蒸馏Transformer推理模型更高的准确率。总体而言,我们引入了混合Mamba推理模型,并提供了利用自一致性或长链思维推理来扩展测试时生成的更有效方法。
English
Effective reasoning is crucial to solving complex mathematical problems.
Recent large language models (LLMs) have boosted performance by scaling
test-time computation through long chain-of-thought reasoning. However,
transformer-based models are inherently limited in extending context length due
to their quadratic computational complexity and linear memory requirements. In
this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built
on the Mamba architecture, which allows memory-efficient inference. Our
approach leverages a distillation process from existing reasoning models and is
further enhanced through RL training. Experimental results on the AIME and MATH
benchmarks show that M1 not only outperforms previous linear RNN models but
also matches the performance of state-of-the-art Deepseek R1 distilled
reasoning models at a similar scale. We also compare our generation speed with
a highly performant general purpose inference engine, vLLM, and observe more
than a 3x speedup compared to a same size transformer. With throughput speedup,
we are able to achieve higher accuracy compared to DeepSeek R1 distilled
transformer reasoning models under a fixed generation time budget using
self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model
and provide a more effective approach to scaling test-time generation using
self-consistency or long chain of thought reasoning.Summary
AI-Generated Summary