LLM에서의 긴 문맥 확장과 일반화에 대한 통제된 연구

초록

넓은 텍스트 이해와 맥락 학습에는 전체 문서 맥락을 활용하는 언어 모델이 필요합니다. 긴 맥락 모델을 직접 훈련하는 데 관련된 구현적인 어려움으로 인해, 많은 방법이 제안되어 왔습니다. 이러한 방법들을 비교하는 것이 어려워 데이터와 모델 클래스의 차이로 인해 긴 맥락 성능을 평가하고 표준 평가와 어떻게 다른지에 대한 불확실성이 생겼습니다. 우리는 표준화된 평가를 통해 확장 방법에 대한 통제된 프로토콜을 구현하고 일관된 기본 모델과 확장 데이터를 활용합니다. 우리의 연구는 긴 맥락 행동에 대한 여러 통찰을 제공합니다. 첫째, 우리는 퍼플렉서티가 일반적인 성능 지표로서 긴 맥락 작업에서도 중요한 역할을 한다는 것을 재확인합니다. 둘째, 현재의 근사 어텐션 방법이 긴 맥락 작업에서 일관되게 성능이 부족하다는 것을 발견합니다. 마지막으로, 정확한 파인튜닝 기반 방법이 일반적으로 그들의 확장 범위 내에서 효과적이라는 것을 확인하고, 추정은 여전히 어려운 것으로 나타냅니다. 모든 코드베이스, 모델 및 체크포인트는 오픈 소스로 제공되며, AI 개발의 이 중요한 영역에서의 추가 연구를 촉진하고 투명성을 증진합니다.

English

Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.

LLM에서의 긴 문맥 확장과 일반화에 대한 통제된 연구

A Controlled Study on Long Context Extension and Generalization in LLMs

초록

Summary

Support

Support