ChatPaper.aiChatPaper

数学推理中测试时缩放的语言泛化能力

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

February 24, 2025
作者: Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne
cs.AI

摘要

预训练计算规模的扩大已被证明对实现多语言能力有效,但测试时扩展是否同样有效?在本研究中,我们引入了MCLM,一个包含55种语言竞赛级数学题目的多语言数学基准。我们测试了三种测试时扩展方法——结果奖励建模(ORM)、过程奖励建模(ORM)和预算强制(BF)——在Qwen2.5-1.5B Math和我们为扩展推理训练的多语言大模型MR1-1.5B上的表现。实验表明,使用Qwen2.5-1.5B Math结合ORM在MCLM上获得35.8分,而MR1-1.5B结合BF则达到35.2分。尽管“思考型大模型”近期备受关注,但我们发现,在相似的推理计算量(FLOPs)限制下,其性能与传统扩展方法如最佳N选一相当。此外,虽然BF在英语AIME上带来了20分的提升,但在其他语言上平均仅提高1.94分——这一趋势在我们研究的其他测试时扩展方法中同样存在——凸显出测试时扩展在多语言任务上的泛化效果可能有限。为促进进一步研究,我们公开了MCLM、MR1-1.5B及评估结果。
English
Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

Summary

AI-Generated Summary

PDF242February 25, 2025