ARWKV: Pretrain은 우리가 필요로 하는 것이 아니며, Transformer에서 탄생한 RNN-Attention 기반 언어 모델

초록

알려진 바와 같이, 다중 헤드 아키텍처에서의 하이브리드 이차 및 서브이차 어텐션 모델은 Transformer 및 선형 RNN 모델을 능가하여, 주로 KV 복잡성을 줄이고 효율성을 향상시키는 데 초점을 맞추었습니다. 표현 능력에 대한 추가 연구를 위해, 우리는 순수 원시 RWKV-7 어텐션을 기반으로 한 Qwen 2.5에서 정제된 일련의 모델을 소개합니다. 이는 RNN을 더 표현력 있게 만들고, 트랜스포머를 뛰어넘는 상태 추적 능력을 보여줍니다. RWKV-6 아키텍처를 기반으로 한 QRWK 32B와 함께 작업하며, 16 AMD MI300X GPU를 사용하여 전체 지식 처리 시간을 8시간으로 단축하는 또 다른 방법을 사용하면서 Qwen 2.5의 성능을 유지합니다. 사실, 정제 과정은 Qwen 뿐만 아니라 모든 LLM을 활용할 수 있으며, 더 적은 토큰을 가진 작은 LLM으로부터 더 큰 LLM으로의 지식 전이를 가능하게 합니다. 더 강력한 기반 모델을 구축하는 데 대한 세부 과정을 설명하고 통찰을 공유할 것입니다. 계속해서 업데이트될 예정인 이 연구는 https://github.com/yynil/RWKVInside, https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1에서 모델 체크포인트와 소스 코드를 제공합니다.

English

As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at https://github.com/yynil/RWKVInside{https://github.com/yynil/RWKVInside}, https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.

ARWKV: Pretrain은 우리가 필요로 하는 것이 아니며, Transformer에서 탄생한 RNN-Attention 기반 언어 모델

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

초록

Support