高效預訓練長度擴展
Efficient Pretraining Length Scaling
April 21, 2025
作者: Bohong Wu, Shen Yan, Sijun Zhang, Jianqiao Lu, Yutao Zeng, Ya Wang, Xun Zhou
cs.AI
摘要
近期大型語言模型的進展已展現了在後訓練階段進行長度擴展的有效性,然而其在預訓練中的潛力仍未被充分探索。我們提出了平行隱藏解碼轉換器(PHD-Transformer),這是一種新穎的框架,能在保持推理效率的同時,於預訓練期間實現高效的長度擴展。PHD-Transformer通過一種創新的鍵值(KV)快取管理策略達成此目標,該策略區分了原始詞元與隱藏解碼詞元。透過僅保留原始詞元的KV快取以處理長距離依賴,並在使用後立即丟棄隱藏解碼詞元,我們的方法在保持與標準轉換器相同KV快取大小的同時,實現了有效的長度擴展。為了進一步提升性能,我們引入了兩種優化變體:PHD-SWA採用滑動窗口注意力機制以保留局部依賴,而PHD-CSWA則實施分塊滑動窗口注意力機制,以消除預填充時間的線性增長。大量實驗結果顯示,在多個基準測試中均取得了持續的改進。
English
Recent advances in large language models have demonstrated the effectiveness
of length scaling during post-training, yet its potential in pre-training
remains underexplored. We present the Parallel Hidden Decoding Transformer
(PHD-Transformer), a novel framework that enables efficient length
scaling during pre-training while maintaining inference efficiency.
PHD-Transformer achieves this through an innovative KV cache
management strategy that distinguishes between original tokens and hidden
decoding tokens. By retaining only the KV cache of original tokens for
long-range dependencies while immediately discarding hidden decoding tokens
after use, our approach maintains the same KV cache size as the vanilla
transformer while enabling effective length scaling. To further enhance
performance, we introduce two optimized variants: PHD-SWA employs
sliding window attention to preserve local dependencies, while
PHD-CSWA implements chunk-wise sliding window attention to eliminate
linear growth in pre-filling time. Extensive experiments demonstrate consistent
improvements across multiple benchmarks.Summary
AI-Generated Summary