JuStRank: 시스템 순위 매기기를 위한 LLM 판사들의 벤치마킹

초록

생성적 AI의 신속한 발전으로 인해, 다양한 모델과 설정 사이에서 체계적으로 비교하고 선택해야 하는 긴급한 필요성이 있습니다. 이러한 평가의 규모와 다양성은 이러한 도전에 대한 탐구적인 해결책으로 LLM 기반 판사의 사용을 필수적으로 만듭니다. 이 접근 방식은 먼저 LLM 판사의 품질을 검증해야 한다는 점이 중요합니다. 이전 연구는 LLM 판사의 인스턴스 기반 평가에 초점을 맞추어 왔으며, 판사가 일련의 응답 또는 응답 쌍을 평가하면서 그들의 소스 시스템에 중립적인 것으로 평가되었습니다. 우리는 이러한 설정이 판사가 특정 시스템에 대한 긍정적 또는 부정적 편향과 같은 시스템 수준 순위에 영향을 미치는 중요한 요소를 간과한다고 주장합니다. 이 간극을 해결하기 위해, 우리는 시스템 순위 판사로서 LLM 판사의 대규모 연구를 실시합니다. 시스템 점수는 여러 시스템 출력에 걸쳐 판단 점수를 집계함으로써 생성되며, 결과적인 시스템 순위를 인간 기반 순위와 비교하여 판사의 품질을 평가합니다. 전반적인 판사 평가 이상으로, 우리의 분석은 판사의 행동을 포함한 세밀한 성격화를 제공합니다. 그들의 결정력과 편향을 포함합니다.

English

Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

JuStRank: 시스템 순위 매기기를 위한 LLM 판사들의 벤치마킹

JuStRank: Benchmarking LLM Judges for System Ranking

초록

Support