ARWKV：预训练并非我们所需，一种基于RNN-注意力的语言模型从Transformer中诞生

摘要

众所周知，在多头架构中，混合二次和次二次注意力模型已经超越了Transformer和线性RNN模型，这些工作主要集中在降低KV复杂度和提高效率。为了进一步研究表现力，我们介绍了一系列从Qwen 2.5中提炼出的模型，基于纯原生RWKV-7注意力，旨在使RNN更具表现力，并展示出超越Transformer的状态跟踪能力。我们使用基于RWKV-6架构的QRWK 32B进行研究，这是另一种方法，将整个知识处理时间缩短至仅需8小时，使用16块AMD MI300X GPU，同时保持Qwen 2.5的性能。事实上，提炼过程可以利用任何LLM，而不仅仅是Qwen，并且能够实现从更大的LLM向更小的LLM进行知识转移，且所需令牌更少。我们将解释详细的过程，并分享我们在构建更强大基础模型方面的见解。请注意，这是一项持续进行的工作，将不断更新。模型检查点和源代码可在以下链接找到：https://github.com/yynil/RWKVInside，https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1。

English

As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at https://github.com/yynil/RWKVInside{https://github.com/yynil/RWKVInside}, https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.

ARWKV：预训练并非我们所需，一种基于RNN-注意力的语言模型从Transformer中诞生

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

摘要

Summary

Support