LongPO: 단기에서 장기로의 선호 최적화를 통한 대규모 언어 모델의 장기 문맥 자기 진화

초록

대규모 언어 모델(LLMs)은 사전 학습과 정렬을 통해 뛰어난 능력을 보여주고 있습니다. 그러나 우수한 단문 컨텍스트 LLM은 장문 컨텍스트 시나리오에서 충분한 장문 컨텍스트 정렬이 부족하여 성능이 저하될 수 있습니다. 이러한 정렬 과정은 장문 컨텍스트에 대한 인간 주석의 비현실성과 단문 및 장문 컨텍스트 성능의 균형을 맞추는 어려움으로 인해 여전히 도전적인 과제로 남아 있습니다. 이러한 문제를 해결하기 위해, 우리는 단문 컨텍스트 LLM이 내부적으로 단문 컨텍스트 능력을 전이하여 장문 컨텍스트 작업에서 뛰어나도록 자기 진화할 수 있도록 하는 LongPO를 소개합니다. LongPO는 LLM이 자체 생성한 단문에서 장문으로의 선호도 데이터를 학습하도록 하며, 이 데이터는 동일한 지시에 대해 장문 컨텍스트 입력과 압축된 단문 컨텍스트 입력에 대해 생성된 쌍으로 이루어진 응답들로 구성됩니다. 이 선호도는 단문 컨텍스트 정렬 과정에서 배양된 LLM의 능력과 잠재력을 보여주며, 이는 충분히 정렬되지 않은 장문 컨텍스트 시나리오에서 약화될 수 있습니다. 또한, LongPO는 장문 컨텍스트 정렬 과정에서 단문 컨텍스트 성능 저하를 완화하기 위해 단문에서 장문으로의 KL 제약을 포함합니다. 128K에서 512K 컨텍스트 길이로 Mistral-7B-Instruct-v0.2에 적용했을 때, LongPO는 단문 컨텍스트 성능을 완전히 유지하면서 단순한 SFT와 DPO를 장문 및 단문 컨텍스트 작업 모두에서 크게 능가했습니다. 특히, \ourMethod로 학습된 모델은 광범위한 장문 컨텍스트 주석과 더 큰 매개변수 규모를 포함하는 우수한 LLM(예: GPT-4-128K)의 결과에 필적하거나 이를 능가하는 장문 컨텍스트 벤치마크 결과를 달성할 수 있습니다.

English

Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, \ourMethod-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales.

LongPO: 단기에서 장기로의 선호 최적화를 통한 대규모 언어 모델의 장기 문맥 자기 진화

LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

초록

Summary

Support