ChatPaper.aiChatPaper

Step-KTO:通过分步二进制反馈优化数学推理

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

January 18, 2025
作者: Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang
cs.AI

摘要

近期大型语言模型(LLMs)在数学推理方面取得了显著成功。尽管像思维链提示和自一致性抽样等方法取得了进展,但这些进展通常侧重于最终的正确性,而未确保底层推理过程的连贯性和可靠性。本文介绍了Step-KTO,这是一个训练框架,结合了过程级和结果级的二元反馈,以引导LLMs朝着更值得信赖的推理轨迹发展。通过为中间推理步骤和最终答案提供二元评估,Step-KTO鼓励模型遵循逻辑推进,而不是依赖表面的捷径。我们在具有挑战性的数学基准测试上进行的实验表明,Step-KTO显著提高了最终答案的准确性和中间推理步骤的质量。例如,在MATH-500数据集上,Step-KTO在Pass@1准确率方面较强基线取得了显著改进。这些结果突显了将分步过程反馈整合到LLM训练中的潜力,为更具解释性和可靠性的推理能力铺平了道路。
English
Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

Summary

AI-Generated Summary

PDF153January 24, 2025