LLM 中的長上下文擴展與泛化的受控研究
A Controlled Study on Long Context Extension and Generalization in LLMs
September 18, 2024
作者: Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush
cs.AI
摘要
廣泛文本理解和上下文學習需要利用完整文件上下文的語言模型。由於直接訓練長上下文模型所涉及的實施挑戰,許多方法已被提出以擴展模型以處理長上下文。然而,由於數據和模型類別的差異,比較這些方法一直是具有挑戰性的,這導致了如何評估長上下文性能以及它是否與標準評估有所不同的不確定性。我們實施了一個受控的擴展方法協議,具有標準化評估,利用一致的基本模型和擴展數據。我們的研究提供了幾個關於長上下文行為的見解。首先,我們重申了困惑度作為一個通用性能指標的關鍵作用,即使在更長的上下文任務中也是如此。其次,我們發現當前的近似注意力方法在長上下文任務中系統性地表現不佳。最後,我們確認基於精確微調的方法通常在其擴展範圍內是有效的,而外推仍然具有挑戰性。所有代碼庫、模型和檢查點將開源提供,促進透明度並促進在這一AI發展關鍵領域的進一步研究。
English
Broad textual understanding and in-context learning require language models
that utilize full document contexts. Due to the implementation challenges
associated with directly training long-context models, many methods have been
proposed for extending models to handle long contexts. However, owing to
differences in data and model classes, it has been challenging to compare these
approaches, leading to uncertainty as to how to evaluate long-context
performance and whether it differs from standard evaluation. We implement a
controlled protocol for extension methods with a standardized evaluation,
utilizing consistent base models and extension data. Our study yields several
insights into long-context behavior. First, we reaffirm the critical role of
perplexity as a general-purpose performance indicator even in longer-context
tasks. Second, we find that current approximate attention methods
systematically underperform across long-context tasks. Finally, we confirm that
exact fine-tuning based methods are generally effective within the range of
their extension, whereas extrapolation remains challenging. All codebases,
models, and checkpoints will be made available open-source, promoting
transparency and facilitating further research in this critical area of AI
development.Summary
AI-Generated Summary