NEMOTRON-CROSSTHINK：將自我學習能力擴展至數學推理之外

摘要

大型語言模型（LLMs）已展現出強大的推理能力，尤其是在通過強化學習（RL）進行增強後。雖然先前的研究已成功將RL應用於數學推理——其中規則和正確性有明確定義——但由於數據有限、缺乏可驗證的獎勵結構以及多樣化的任務需求，將這些方法推廣到更廣泛的推理領域仍然具有挑戰性。在本研究中，我們提出了NEMOTRON-CROSSTHINK框架，該框架系統地將多領域語料庫（包括合成和現實世界的問答對）納入RL訓練，以提高跨多樣推理任務的泛化能力。NEMOTRON-CROSSTHINK通過以下方式應對關鍵挑戰：(1) 整合來自STEM、人文、社會科學等多個領域的數據；(2) 應用結構化模板（如多選題和開放式問題）來控制答案空間的複雜性；(3) 篩選可驗證的答案；(4) 優化數據混合策略，有效利用多來源數據。我們的方法實現了超越數學的可擴展且可驗證的獎勵建模，並在數學（MATH-500：+30.1%，AMC23：+27.5%）和非數學推理基準（MMLU-PRO：+12.8%，GPQA-DIAMOND：+11.3%，AGIEVAL：+15.1%，SUPERGPQA：+3.8%）上展示了顯著的準確性提升。此外，NEMOTRON-CROSSTHINK顯著提高了回應效率——正確答案使用的token數量減少了28%——顯示出更為集中和有效的推理。通過NEMOTRON-CROSSTHINK，我們證明了在RL中整合多領域、多格式數據能夠產生更準確、高效且具備泛化能力的LLMs。

English

Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL). While prior work has successfully applied RL to mathematical reasoning -- where rules and correctness are well-defined -- generalizing these methods to broader reasoning domains remains challenging due to limited data, the lack of verifiable reward structures, and diverse task requirements. In this work, we propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks. NEMOTRON-CROSSTHINK addresses key challenges by (1) incorporating data from varied sources spanning STEM, humanities, social sciences, etc.; (2) applying structured templates (e.g., multiple-choice and open-ended) to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies that utilizes data from multiple sources effectively. Our approach enables scalable and verifiable reward modeling beyond mathematics and demonstrates improved accuracies on both math (MATH-500: +30.1%, AMC23:+27.5%) and non-math reasoning benchmarks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%). Moreover, NEMOTRON-CROSSTHINK exhibits significantly improved response efficiency -- using 28% fewer tokens for correct answers -- highlighting more focused and effective reasoning. Through NEMOTRON-CROSSTHINK, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs.

NEMOTRON-CROSSTHINK：將自我學習能力擴展至數學推理之外

NEMOTRON-CROSSTHINK: Scaling Self-Learning beyond Math Reasoning

摘要

Summary

Support

Support