ChatPaper.aiChatPaper

NEMOTRON-CROSSTHINK:将自我学习能力扩展至数学推理之外

NEMOTRON-CROSSTHINK: Scaling Self-Learning beyond Math Reasoning

April 15, 2025
作者: Syeda Nahida Akter, Shrimai Prabhumoye, Matvei Novikov, Seungju Han, Ying Lin, Evelina Bakhturi, Eric Nyberg, Yejin Choi, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
cs.AI

摘要

大型语言模型(LLMs)已展现出强大的推理能力,尤其是在通过强化学习(RL)增强之后。尽管先前的研究已成功将RL应用于数学推理——其中规则与正确性定义明确——但将这些方法推广至更广泛的推理领域仍面临挑战,原因包括数据有限、缺乏可验证的奖励结构以及多样化的任务需求。在本研究中,我们提出了NEMOTRON-CROSSTHINK框架,该框架系统地将多领域语料库,包括合成与现实世界中的问答对,融入RL训练中,以提升跨多样推理任务的泛化能力。NEMOTRON-CROSSTHINK通过以下方式应对关键挑战:(1)整合来自STEM、人文、社会科学等不同领域的数据;(2)应用结构化模板(如多项选择与开放式问题)以控制答案空间的复杂性;(3)筛选可验证的答案;(4)优化数据混合策略,有效利用多源数据。我们的方法实现了超越数学的可扩展且可验证的奖励建模,并在数学(MATH-500:+30.1%,AMC23:+27.5%)与非数学推理基准测试(MMLU-PRO:+12.8%,GPQA-DIAMOND:+11.3%,AGIEVAL:+15.1%,SUPERGPQA:+3.8%)上均展示了准确率的提升。此外,NEMOTRON-CROSSTHINK显著提高了响应效率——正确回答所用令牌数减少28%——体现了更为集中且高效的推理。通过NEMOTRON-CROSSTHINK,我们证明了在RL中整合多领域、多格式数据能够打造出更准确、高效且泛化能力更强的LLMs。
English
Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL). While prior work has successfully applied RL to mathematical reasoning -- where rules and correctness are well-defined -- generalizing these methods to broader reasoning domains remains challenging due to limited data, the lack of verifiable reward structures, and diverse task requirements. In this work, we propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks. NEMOTRON-CROSSTHINK addresses key challenges by (1) incorporating data from varied sources spanning STEM, humanities, social sciences, etc.; (2) applying structured templates (e.g., multiple-choice and open-ended) to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies that utilizes data from multiple sources effectively. Our approach enables scalable and verifiable reward modeling beyond mathematics and demonstrates improved accuracies on both math (MATH-500: +30.1%, AMC23:+27.5%) and non-math reasoning benchmarks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%). Moreover, NEMOTRON-CROSSTHINK exhibits significantly improved response efficiency -- using 28% fewer tokens for correct answers -- highlighting more focused and effective reasoning. Through NEMOTRON-CROSSTHINK, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs.

Summary

AI-Generated Summary

PDF64April 22, 2025