关键标记很重要:标记级对比估计增强了LLM的推理能力。
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability
November 29, 2024
作者: Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, Zhaopeng Tu
cs.AI
摘要
大型语言模型(LLMs)在推理任务上表现出色。它们利用自回归标记生成来构建推理轨迹,从而实现连贯的思维链的发展。在这项工作中,我们探讨了个别标记对推理任务最终结果的影响。我们确定了导致LLMs推理轨迹错误的“关键标记”的存在。具体而言,我们发现当LLMs被迫解码其他标记而不是关键标记时,它们往往会产生积极的结果。受到这一观察的启发,我们提出了一种新方法 - cDPO - 旨在在对齐过程中自动识别和进行关键标记的标记级奖励。具体而言,我们开发了一种对比估计方法来自动识别关键标记。通过比较积极和消极模型的生成可能性,实现了这一点。为了实现这一点,我们分别在各种推理轨迹上对积极和消极模型进行微调,因此,它们能够识别导致错误结果的错误轨迹中的关键标记。此外,为了在对齐过程中进一步将模型与关键标记信息对齐,我们将传统的DPO算法扩展到标记级DPO,并利用上述积极和消极模型的差异可能性作为标记级DPO学习的重要权重。在GSM8K和MATH500基准测试上进行的实验结果,使用两种广泛使用的模型Llama-3(8B和70B)和deepseek-math(7B),展示了所提出的cDPO方法的有效性。
English
Large Language Models (LLMs) have exhibited remarkable performance on
reasoning tasks. They utilize autoregressive token generation to construct
reasoning trajectories, enabling the development of a coherent chain of
thought. In this work, we explore the impact of individual tokens on the final
outcomes of reasoning tasks. We identify the existence of ``critical tokens''
that lead to incorrect reasoning trajectories in LLMs. Specifically, we find
that LLMs tend to produce positive outcomes when forced to decode other tokens
instead of critical tokens. Motivated by this observation, we propose a novel
approach - cDPO - designed to automatically recognize and conduct token-level
rewards for the critical tokens during the alignment process. Specifically, we
develop a contrastive estimation approach to automatically identify critical
tokens. It is achieved by comparing the generation likelihood of positive and
negative models. To achieve this, we separately fine-tune the positive and
negative models on various reasoning trajectories, consequently, they are
capable of identifying identify critical tokens within incorrect trajectories
that contribute to erroneous outcomes. Moreover, to further align the model
with the critical token information during the alignment process, we extend the
conventional DPO algorithms to token-level DPO and utilize the differential
likelihood from the aforementioned positive and negative model as important
weight for token-level DPO learning.Experimental results on GSM8K and MATH500
benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math
(7B) demonstrate the effectiveness of the propsoed approach cDPO.Summary
AI-Generated Summary