使用Evalica创建可靠、可复现且速度极快的排行榜

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

December 15, 2024
作者: Dmitry Ustalov
cs.AI

摘要

自然语言处理(NLP)技术的快速发展,如经过指导调优的大型语言模型(LLMs),促使现代评估协议的发展,其中包括人机反馈。我们介绍Evalica,这是一个开源工具包,有助于创建可靠且可重现的模型排行榜。本文介绍了其设计,评估了其性能,并通过其Web界面、命令行界面和Python API展示了其可用性。
English
The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

Summary

AI-Generated Summary

PDF22December 17, 2024