数学推理中测试时缩放的语言泛化能力

摘要

预训练计算规模的扩大已被证明对实现多语言能力有效，但测试时扩展是否同样有效？在本研究中，我们引入了MCLM，一个包含55种语言竞赛级数学题目的多语言数学基准。我们测试了三种测试时扩展方法——结果奖励建模（ORM）、过程奖励建模（ORM）和预算强制（BF）——在Qwen2.5-1.5B Math和我们为扩展推理训练的多语言大模型MR1-1.5B上的表现。实验表明，使用Qwen2.5-1.5B Math结合ORM在MCLM上获得35.8分，而MR1-1.5B结合BF则达到35.2分。尽管“思考型大模型”近期备受关注，但我们发现，在相似的推理计算量（FLOPs）限制下，其性能与传统扩展方法如最佳N选一相当。此外，虽然BF在英语AIME上带来了20分的提升，但在其他语言上平均仅提高1.94分——这一趋势在我们研究的其他测试时扩展方法中同样存在——凸显出测试时扩展在多语言任务上的泛化效果可能有限。为促进进一步研究，我们公开了MCLM、MR1-1.5B及评估结果。

English

Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

数学推理中测试时缩放的语言泛化能力

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

摘要

Summary

Support

Support