IberBench：伊比利亞語言的大型語言模型評估

摘要

大型語言模型（LLMs）的全面評估仍然具有挑戰性，尤其是對於英語以外的語言，這些語言的高質量數據往往有限。現有的基準測試和排行榜主要集中於英語，僅有少數涉及其他語言。這些基準測試在幾個關鍵領域存在不足：它們忽視了語言多樣性，優先考慮基礎的自然語言處理（NLP）能力而非工業相關任務，並且是靜態的。基於這些考量，我們提出了IberBench，這是一個全面且可擴展的基準測試，旨在評估LLMs在伊比利亞半島和伊比利亞美洲地區使用的語言中，對基礎和工業相關NLP任務的表現。IberBench整合了來自評估活動和近期基準測試的101個數據集，涵蓋了22個任務類別，如情感和情緒分析、毒性檢測和摘要生成。該基準測試解決了當前評估實踐中的關鍵限制，例如缺乏語言多樣性和靜態評估設置，通過支持持續更新和由專家委員會審核的社區驅動模型和數據集提交。我們評估了從1億到140億參數的23個LLMs，並提供了對其優勢和局限性的實證洞察。我們的研究結果表明：（i）LLMs在工業相關任務上的表現不如基礎任務，（ii）加利西亞語和巴斯克語的平均表現較低，（iii）某些任務的結果接近隨機，（iv）在其他任務中，LLMs的表現高於隨機但低於共享任務系統。IberBench提供了整個評估流程的開源實現，包括數據集規範化和托管、LLMs的增量評估以及一個公開可訪問的排行榜。

English

Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.

IberBench：伊比利亞語言的大型語言模型評估

IberBench: LLM Evaluation on Iberian Languages

摘要

Summary

Support

Support