高效预训练长度扩展
Efficient Pretraining Length Scaling
April 21, 2025
作者: Bohong Wu, Shen Yan, Sijun Zhang, Jianqiao Lu, Yutao Zeng, Ya Wang, Xun Zhou
cs.AI
摘要
近期大型语言模型的进展已证实了训练后长度扩展的有效性,然而其在预训练阶段的潜力仍待深入挖掘。我们提出了并行隐藏解码Transformer(PHD-Transformer),这一新颖框架在保持推理效率的同时,实现了预训练期间的高效长度扩展。PHD-Transformer通过创新的KV缓存管理策略达成此目标,该策略区分了原始令牌与隐藏解码令牌。我们的方法仅保留原始令牌的KV缓存以维持长程依赖关系,并在使用后立即丢弃隐藏解码令牌,从而在保持与标准Transformer相同KV缓存大小的同时,实现了有效的长度扩展。为进一步提升性能,我们引入了两种优化变体:PHD-SWA采用滑动窗口注意力机制以保留局部依赖关系,而PHD-CSWA则实施分块式滑动窗口注意力,消除了预填充时间的线性增长。大量实验表明,该框架在多个基准测试上均取得了持续的性能提升。
English
Recent advances in large language models have demonstrated the effectiveness
of length scaling during post-training, yet its potential in pre-training
remains underexplored. We present the Parallel Hidden Decoding Transformer
(PHD-Transformer), a novel framework that enables efficient length
scaling during pre-training while maintaining inference efficiency.
PHD-Transformer achieves this through an innovative KV cache
management strategy that distinguishes between original tokens and hidden
decoding tokens. By retaining only the KV cache of original tokens for
long-range dependencies while immediately discarding hidden decoding tokens
after use, our approach maintains the same KV cache size as the vanilla
transformer while enabling effective length scaling. To further enhance
performance, we introduce two optimized variants: PHD-SWA employs
sliding window attention to preserve local dependencies, while
PHD-CSWA implements chunk-wise sliding window attention to eliminate
linear growth in pre-filling time. Extensive experiments demonstrate consistent
improvements across multiple benchmarks.Summary
AI-Generated Summary