ChatPaper.aiChatPaper

M1:邁向可擴展的測試時計算——基於Mamba推理模型

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

April 14, 2025
作者: Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao
cs.AI

摘要

有效的推理對於解決複雜的數學問題至關重要。近年來,大型語言模型(LLMs)通過長鏈式推理擴展了測試時的計算能力,從而提升了性能。然而,基於Transformer的模型由於其二次計算複雜性和線性記憶體需求,在擴展上下文長度方面存在固有局限。本文介紹了一種基於Mamba架構的新型混合線性RNN推理模型M1,該模型允許記憶體高效的推理。我們的方法利用了現有推理模型的蒸餾過程,並通過強化學習訓練進一步增強。在AIME和MATH基準測試中的實驗結果顯示,M1不僅超越了以往的線性RNN模型,還與同規模下最先進的Deepseek R1蒸餾推理模型的性能相當。我們還將生成速度與高性能通用推理引擎vLLM進行了比較,並觀察到相比於相同規模的Transformer,速度提升了3倍以上。通過吞吐量的提升,我們能夠在固定的生成時間預算下,利用自一致性投票,實現比DeepSeek R1蒸餾Transformer推理模型更高的準確率。總體而言,我們引入了一種混合Mamba推理模型,並提供了一種更有效的方法來擴展測試時的生成,無論是使用自一致性還是長鏈式推理。
English
Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

Summary

AI-Generated Summary

PDF72April 15, 2025