關鍵標記至關重要:標記級對比估計增強了語言模型的推理能力。

Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

November 29, 2024
作者: Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, Zhaopeng Tu
cs.AI

摘要

大型語言模型(LLMs)在推理任務上展現出卓越的表現。它們利用自回歸標記生成來構建推理軌跡,從而促使一個連貫的思維鏈的發展。在這項工作中,我們探討個別標記對推理任務最終結果的影響。我們確定了在LLMs中導致不正確推理軌跡的「關鍵標記」的存在。具體而言,我們發現當LLMs被迫解碼其他標記而不是關鍵標記時,它們往往會產生正面結果。受到這一觀察的啟發,我們提出了一種新方法 - cDPO - 旨在在對齊過程中自動識別並對關鍵標記進行標記級獎勵。具體而言,我們開發了一種對比估計方法,用於自動識別關鍵標記。通過比較正面和負面模型的生成概率,實現了這一點。為了實現這一目標,我們分別對正面和負面模型在各種推理軌跡上進行了微調,因此,它們能夠識別不正確軌跡中導致錯誤結果的關鍵標記。此外,為了在對齊過程中進一步使模型與關鍵標記信息保持一致,我們將傳統的DPO算法擴展為標記級DPO,並利用上述正面和負面模型的差異概率作為標記級DPO學習的重要權重。在GSM8K和MATH500基準測試上,使用兩個廣泛使用的模型Llama-3(8B和70B)和deepseek-math(7B)進行的實驗結果展示了所提出的方法cDPO的有效性。
English
Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens'' that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO learning.Experimental results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.

Summary

AI-Generated Summary

PDF577December 4, 2024