단계별 이진 피드백을 통해 수학적 추론 최적화하기: Step-KTO

초록

최근 대형 언어 모델(Large language models, LLMs)은 수학적 추론에서 놀라운 성과를 보여주고 있습니다. 연상사고 체인(chain-of-thought prompting) 및 자일 일관성 샘플링(self-consistency sampling)과 같은 방법들의 발전에도 불구하고, 이러한 발전은 종종 최종 정확성에 초점을 맞추지만 기저 추론 과정이 일관되고 신뢰할 수 있는지를 보장하지는 않습니다. 본 논문에서는 Step-KTO를 소개하는데, 이는 LLMs를 더 신뢰할 수 있는 추론 경로로 이끄는 과정 수준과 결과 수준의 이진 피드백을 결합한 교육 프레임워크입니다. 중간 추론 단계와 최종 답변 양쪽에 대한 이진 평가를 제공함으로써, Step-KTO는 모델이 논리적 진행을 따르도록 유도하고 피상적인 단축키에 의존하지 않도록 합니다. 우리의 실험 결과는 어려운 수학적 벤치마크에서 Step-KTO가 최종 답변 정확도와 중간 추론 단계의 품질을 크게 향상시킨다는 것을 보여줍니다. 예를 들어, MATH-500 데이터셋에서, Step-KTO는 강력한 기준선에 비해 Pass@1 정확도에서 주목할만한 개선을 달성합니다. 이러한 결과는 단계별 과정 피드백을 LLM 교육에 통합함으로써 해석 가능하고 신뢰할 수 있는 추론 능력으로 나아가는 가능성을 강조합니다.

English

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

단계별 이진 피드백을 통해 수학적 추론 최적화하기: Step-KTO

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

초록

Support