ChatPaper.aiChatPaper

LLaMA-Berry:針對類似O1級奧林匹亞水準的數學推理進行成對優化

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

October 3, 2024
作者: Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, Dongzhan Zhou
cs.AI

摘要

本文提出了一個先進的數學問題解決框架LLaMA-Berry,旨在增強大型語言模型(LLMs)的數學推理能力。該框架結合了蒙特卡羅樹搜索(MCTS)和迭代自我優化,以優化推理路徑,並利用成對獎勵模型全局評估不同路徑。通過利用LLMs的自我評論和重寫能力,自我優化應用於MCTS(SR-MCTS)克服了傳統逐步和貪婪搜索算法的低效和限制,促進了對解決方案空間的更有效探索。成對偏好獎勵模型(PPRM),靈感來自於從人類反饋中學習的強化學習(RLHF),然後用於對解決方案之間的成對偏好進行建模,利用增強波達計數(EBC)方法將這些偏好綜合成全局排名分數,以找到更好的答案。這種方法解決了數學推理任務中得分變異性和非獨立分佈的挑戰。該框架已在一般和高級基準測試中進行了測試,在搜索效率和問題解決能力方面表現優異,相較於現有方法如ToT和rStar,在複雜的奧林匹亞級基準測試中表現尤為出色,包括GPQA、AIME24和AMC23。
English
This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.

Summary

AI-Generated Summary

PDF554November 16, 2024