V-Seek: Accelerare il Ragionamento dei Modelli Linguistici su Piattaforme Server-class RISC-V ad Hardware Aperto

Abstract

La recente crescita esponenziale dei Large Language Model (LLM) si è basata su sistemi GPU. Tuttavia, le CPU stanno emergendo come un'alternativa flessibile e a basso costo, specialmente per carichi di lavoro di inferenza e ragionamento. RISC-V sta rapidamente guadagnando terreno in questo ambito, grazie alla sua ISA aperta e neutrale rispetto ai fornitori. Tuttavia, l'hardware RISC-V per i carichi di lavoro LLM e il corrispondente ecosistema software non sono ancora completamente maturi e ottimizzati, a causa della necessità di tuning specifico per il dominio. Questo articolo mira a colmare questa lacuna, concentrandosi sull'ottimizzazione dell'inferenza LLM sul Sophon SG2042, la prima CPU RISC-V many-core con capacità di elaborazione vettoriale disponibile commercialmente. Su due recenti LLM all'avanguardia ottimizzati per il ragionamento, DeepSeek R1 Distill Llama 8B e DeepSeek R1 Distill QWEN 14B, otteniamo 4,32/2,29 token/s per la generazione di token e 6,54/3,68 token/s per l'elaborazione dei prompt, con un accelerazione fino a 2,9x/3,0x rispetto alla nostra baseline.

English

The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.

V-Seek: Accelerare il Ragionamento dei Modelli Linguistici su Piattaforme Server-class RISC-V ad Hardware Aperto

V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

Abstract

Summary

Support

Support