数学推理中测试时缩放的语言泛化能力
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning
February 24, 2025
作者: Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne
cs.AI
摘要
预训练计算规模的扩大已被证明对实现多语言能力有效,但测试时扩展是否同样有效?在本研究中,我们引入了MCLM,一个包含55种语言竞赛级数学题目的多语言数学基准。我们测试了三种测试时扩展方法——结果奖励建模(ORM)、过程奖励建模(ORM)和预算强制(BF)——在Qwen2.5-1.5B Math和我们为扩展推理训练的多语言大模型MR1-1.5B上的表现。实验表明,使用Qwen2.5-1.5B Math结合ORM在MCLM上获得35.8分,而MR1-1.5B结合BF则达到35.2分。尽管“思考型大模型”近期备受关注,但我们发现,在相似的推理计算量(FLOPs)限制下,其性能与传统扩展方法如最佳N选一相当。此外,虽然BF在英语AIME上带来了20分的提升,但在其他语言上平均仅提高1.94分——这一趋势在我们研究的其他测试时扩展方法中同样存在——凸显出测试时扩展在多语言任务上的泛化效果可能有限。为促进进一步研究,我们公开了MCLM、MR1-1.5B及评估结果。
English
Scaling pre-training compute has proven effective for achieving
mulitlinguality, but does the same hold for test-time scaling? In this work, we
introduce MCLM, a multilingual math benchmark featuring competition-level
problems in 55 languages. We test three test-time scaling methods-Outcome
Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing
(BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for
extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM
achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although
"thinking LLMs" have recently garnered significant attention, we find that
their performance is comparable to traditional scaling methods like best-of-N
once constrained to similar levels of inference FLOPs. Moreover, while BF
yields a 20-point improvement on English AIME, it provides only a 1.94-point
average gain across other languages-a pattern consistent across the other
test-time scaling methods we studied-higlighting that test-time scaling may not
generalize as effectively to multilingual tasks. To foster further research, we
release MCLM, MR1-1.5B, and evaluation results.Summary
AI-Generated Summary