중요한 토큰이 중요하다: 토큰 수준의 대조적 추정이 LLM의 추론 능력을 향상시킨다.

초록

대형 언어 모델(Large Language Models, LLMs)은 추론 작업에서 놀라운 성능을 보여주었습니다. 이들은 자기 회귀 토큰 생성을 활용하여 추론 경로를 구성하여 일관된 사고 체인의 발전을 가능케 합니다. 본 연구에서는 개별 토큰이 추론 작업의 최종 결과에 미치는 영향을 탐구합니다. 우리는 LLMs에서 잘못된 추론 경로로 이끄는 "중요 토큰(critical tokens)"의 존재를 확인합니다. 구체적으로, 우리는 중요 토큰 대신 다른 토큰을 해독하도록 강요했을 때 LLMs가 긍정적인 결과를 내는 경향을 발견했습니다. 이 관찰을 바탕으로 우리는 중요 토큰에 대한 토큰 수준 보상을 자동으로 인식하고 수행하는 cDPO라는 새로운 접근법을 제안합니다. 구체적으로, 우리는 긍정적 및 부정적 모델의 생성 가능성을 비교함으로써 중요 토큰을 자동으로 식별하는 대조적 추정 방법을 개발합니다. 이를 위해 우리는 긍정적 및 부정적 모델을 각각 다양한 추론 경로에 대해 별도로 세밀하게 조정하여, 잘못된 결과에 기여하는 잘못된 추론 경로 내 중요 토큰을 식별할 수 있게 합니다. 더불어, 중요 토큰 정보와 모델을 더 잘 일치시키기 위해 일반적인 DPO 알고리즘을 토큰 수준 DPO로 확장하고, 상기한 긍정적 및 부정적 모델로부터의 차이 가능성을 중요한 가중치로 활용하여 토큰 수준 DPO 학습을 진행합니다. GSM8K 및 MATH500 벤치마크에서 두 가지 널리 사용되는 모델인 Llama-3(8B 및 70B) 및 deepseek-math(7B)를 사용한 실험 결과는 제안된 cDPO 접근법의 효과를 입증합니다.

English

Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens'' that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO learning.Experimental results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.

중요한 토큰이 중요하다: 토큰 수준의 대조적 추정이 LLM의 추론 능력을 향상시킨다.

Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

초록

Summary

Support