판사벤치: LLM 기반 판사 평가를 위한 벤치마크

초록

LLM 기반 판사들은 인간 평가의 확장 가능한 대안으로 부상하고 있으며, 모델을 평가, 비교 및 개선하는 데 점점 더 사용되고 있습니다. 그러나 LLM 기반 판사들의 신뢰성 자체는 거의 검토되지 않습니다. LLM이 더 발전함에 따라 그들의 응답은 더 정교해지며, 이를 평가하기 위해 더 강력한 판사들이 필요합니다. 기존의 벤치마크는 주로 판사의 인간 선호와의 일치에 초점을 맞추지만, 종종 인간이 선호하는 것이 사실적이고 논리적으로 정확한 것을 나타내는 데 부적합한 어려운 작업을 고려하지 못합니다. 이를 해결하기 위해 우리는 LLM 기반 판사들을 객관적으로 평가하기 위한 새로운 평가 프레임워크를 제안합니다. 이 프레임워크를 기반으로, 우리는 지식, 추론, 수학 및 코딩을 포함한 어려운 응답 쌍에 대해 LLM 기반 판사들을 평가하는 벤치마크인 JudgeBench를 제안합니다. JudgeBench는 기존의 어려운 데이터셋을 도전적인 응답 쌍으로 변환하는 새로운 파이프라인을 활용하며, 객관적인 정확성을 반영하는 선호 레이블을 갖추고 있습니다. 우리는 프롬프트된 판사들, 파인튜닝된 판사들, 다중 에이전트 판사들 및 보상 모델의 모음에 대한 포괄적인 평가를 통해, JudgeBench가 이전의 벤치마크보다 훨씬 큰 도전을 제시하며, 많은 강력한 모델들(예: GPT-4o)이 무작위 추측보다 약간 더 나은 성과를 보인다는 것을 보여줍니다. 전반적으로, JudgeBench는 점점 더 발전하는 LLM 기반 판사들을 평가하기 위한 신뢰할 수 있는 플랫폼을 제공합니다. 데이터 및 코드는 https://github.com/ScalerLab/JudgeBench 에서 사용할 수 있습니다.

English

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .

판사벤치: LLM 기반 판사 평가를 위한 벤치마크

JudgeBench: A Benchmark for Evaluating LLM-based Judges

초록

Support