JudgeBench：用於評估基於LLM的法官的基準

摘要

基於LLM的評審已成為人類評估的可擴展替代方案，並越來越被用於評估、比較和改進模型。然而，對LLM評審本身的可靠性很少受到審查。隨著LLMs變得更加先進，它們的回應變得更加複雜，需要更強大的評審來評估它們。現有的基準主要集中在評審與人類偏好的一致性，但往往未能考慮到更具挑戰性的任務，在這些任務中，眾包人類偏好是對事實和邏輯正確性的一個不良指標。為了應對這一問題，我們提出了一個新穎的評估框架，以客觀評估基於LLM的評審。基於這個框架，我們提出了JudgeBench，這是一個用於評估基於LLM的評審在跨知識、推理、數學和編碼的具有挑戰性的回應對上的基準。JudgeBench利用一個新穎的流程將現有的困難數據集轉換為具有反映客觀正確性的偏好標籤的具有挑戰性的回應對。我們對一系列提示評審、微調評審、多代理評審和獎勵模型進行了全面評估，結果顯示JudgeBench比以前的基準具有更大的挑戰性，許多強大模型（例如GPT-4o）的表現僅略優於隨機猜測。總的來說，JudgeBench提供了一個可靠的平台，用於評估日益先進的基於LLM的評審。數據和代碼可在https://github.com/ScalerLab/JudgeBench 上找到。

English

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .

JudgeBench：用於評估基於LLM的法官的基準

JudgeBench: A Benchmark for Evaluating LLM-based Judges

摘要

Summary

Support

Support