JuStRank:評估LLM法官的系統排名
JuStRank: Benchmarking LLM Judges for System Ranking
December 12, 2024
作者: Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai
cs.AI
摘要
鑑於生成式人工智慧的快速發展,迫切需要系統性地比較和選擇眾多可用的模型和配置。這些評估的規模和多樣性使得使用基於LLM的評判系統成為應對這一挑戰的引人注目的解決方案。至關重要的是,這種方法首先需要驗證LLM評判系統本身的質量。先前的研究主要集中在基於實例的LLM評判系統評估上,其中一個評判系統在一組回應或回應對上進行評估,而對它們的來源系統則不加區分。我們認為這種設置忽略了影響系統級別排名的關鍵因素,例如評判系統對某些系統的積極或消極偏見。為了填補這一空白,我們進行了第一個大規模的LLM評判系統作為系統排名者的研究。系統分數是通過將評分結果聚合在多個系統輸出上而生成的,評判系統的質量是通過將結果系統排名與基於人類的排名進行比較來評估的。除了整體評判系統評估外,我們的分析還提供了對評判系統行為的細緻特徵描述,包括它們的果斷性和偏見。
English
Given the rapid progress of generative AI, there is a pressing need to
systematically compare and choose between the numerous models and
configurations available. The scale and versatility of such evaluations make
the use of LLM-based judges a compelling solution for this challenge.
Crucially, this approach requires first to validate the quality of the LLM
judge itself. Previous work has focused on instance-based assessment of LLM
judges, where a judge is evaluated over a set of responses, or response pairs,
while being agnostic to their source systems. We argue that this setting
overlooks critical factors affecting system-level ranking, such as a judge's
positive or negative bias towards certain systems. To address this gap, we
conduct the first large-scale study of LLM judges as system rankers. System
scores are generated by aggregating judgment scores over multiple system
outputs, and the judge's quality is assessed by comparing the resulting system
ranking to a human-based ranking. Beyond overall judge assessment, our analysis
provides a fine-grained characterization of judge behavior, including their
decisiveness and bias.Summary
AI-Generated Summary