使用Evalica建立可靠、可重現且極速的排行榜
Reliable, Reproducible, and Really Fast Leaderboards with Evalica
December 15, 2024
作者: Dmitry Ustalov
cs.AI
摘要
自然語言處理(NLP)技術的快速發展,如調校指令的大型語言模型(LLMs),促使現代評估協議的發展,包括人類和機器反饋。我們介紹Evalica,一個開源工具包,有助於創建可靠且可重現的模型排行榜。本文介紹了其設計,評估了其性能,並通過其Web界面、命令行界面和Python API展示了其可用性。
English
The rapid advancement of natural language processing (NLP) technologies, such
as instruction-tuned large language models (LLMs), urges the development of
modern evaluation protocols with human and machine feedback. We introduce
Evalica, an open-source toolkit that facilitates the creation of reliable and
reproducible model leaderboards. This paper presents its design, evaluates its
performance, and demonstrates its usability through its Web interface,
command-line interface, and Python API.Summary
AI-Generated Summary