ChatPaper.aiChatPaper

长PO:通过短到长偏好优化实现大语言模型的长上下文自我进化

LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

February 19, 2025
作者: Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing
cs.AI

摘要

大型语言模型(LLMs)通过预训练和对齐展现了卓越的能力。然而,在短上下文场景中表现出色的LLMs,在长上下文场景中可能表现不佳,这主要是由于长上下文对齐不足所致。这一对齐过程面临挑战,原因在于人类对长上下文进行标注的不可行性,以及平衡短上下文与长上下文性能的难度。为解决这些问题,我们提出了LongPO方法,它使短上下文LLMs能够通过内部转移短上下文能力,自我进化以在长上下文任务中表现出色。LongPO利用LLMs从自我生成的短到长偏好数据中学习,这些数据包含针对相同指令生成的长上下文输入响应及其压缩后的短上下文对应响应。这种偏好揭示了在短上下文对齐过程中培养的LLMs能力与潜力,这些在长上下文对齐不足的场景中可能被削弱。此外,LongPO引入了短到长的KL约束,以减轻长上下文对齐过程中短上下文性能的下降。当应用于从128K到512K上下文长度的Mistral-7B-Instruct-v0.2时,LongPO完全保留了短上下文性能,并在长上下文和短上下文任务中大幅超越了简单的SFT和DPO方法。具体而言,经过\ourMethod训练的模型在长上下文基准测试中取得的成果,可与甚至超越那些涉及大量长上下文标注和更大参数规模的顶级LLMs(如GPT-4-128K)相媲美。
English
Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, \ourMethod-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales.

Summary

AI-Generated Summary

PDF252February 20, 2025