ChatPaper.aiChatPaper

Qwen2.5-1M技术报告

Qwen2.5-1M Technical Report

January 26, 2025
作者: An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang
cs.AI

摘要

我们介绍了Qwen2.5-1M系列模型,将上下文长度扩展到100万个标记。与之前的128K版本相比,Qwen2.5-1M系列通过长上下文预训练和后训练显著增强了长上下文能力。关键技术包括长数据合成、渐进式预训练和多阶段监督微调,有效提升了长上下文性能同时降低了训练成本。为了推广长上下文模型在更广泛的用户群体中的使用,我们提出并开源了推理框架。该框架包括一种长度外推方法,可以将模型上下文长度至少扩展四倍,甚至更多,而无需额外训练。为了降低推理成本,我们实现了一种稀疏注意力方法,以及用于部署场景的分块预填充优化,以及用于提高精度的稀疏度优化方法。此外,我们详细介绍了推理引擎中的优化,包括内核优化、管道并行和调度优化,显著提升了整体推理性能。通过利用我们的推理框架,Qwen2.5-1M模型在具有100万个标记上下文的场景中实现了显著的3倍至7倍的预填充加速。该框架为使用开源模型进行长上下文处理的应用开发提供了高效而强大的解决方案。 Qwen2.5-1M系列目前包括开源模型Qwen2.5-7B-Instruct-1M和Qwen2.5-14B-Instruct-1M,以及API访问模型Qwen2.5-Turbo。评估表明,Qwen2.5-1M模型在长上下文任务中有了很大改进,而在短上下文场景中性能没有受损。具体来说,Qwen2.5-14B-Instruct-1M模型在长上下文任务中明显优于GPT-4o-mini,并支持长度为其八倍的上下文。
English
We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

Summary

AI-Generated Summary

PDF683January 28, 2025