JuStRank：评估LLM法官进行系统排名

摘要

鉴于生成式人工智能的快速发展，迫切需要系统地比较和选择众多可用的模型和配置。这些评估的规模和多样性使得使用基于大型语言模型的评判者成为解决这一挑战的引人注目的方案。至关重要的是，这种方法首先需要验证LLM评判者本身的质量。先前的工作侧重于基于实例的LLM评判者评估，其中评判者在一组响应或响应对上进行评估，而对它们的来源系统保持不可知。我们认为，这种设置忽略了影响系统级排名的关键因素，比如评判者对某些系统的正面或负面偏好。为填补这一空白，我们进行了首次大规模研究，将LLM评判者作为系统排名者。系统得分是通过对多个系统输出的评分进行汇总生成的，评估评判者的质量是通过将得到的系统排名与基于人类的排名进行比较来进行的。除了整体评判者评估外，我们的分析还提供了对评判者行为的细致刻画，包括他们的果断性和偏见。

English

Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

JuStRank：评估LLM法官进行系统排名

JuStRank: Benchmarking LLM Judges for System Ranking

摘要

Summary

Support