群体对比推理：为LLM-as-a-Judge开启全面评估之门

摘要

LLM-as-a-Judge作为一种生成链式思维（CoT）评判的自动评估方法，已被广泛采用。然而，其可靠性因CoT推理无法捕捉全面且深入的细节而受到损害，往往导致不完整的结果。现有方法主要依赖于多数投票或标准扩展，这不足以解决CoT的局限性。我们提出了基于众包的比较评估方法，通过引入额外的众包响应与候选响应进行比较，从而揭示候选响应中更深层次和更全面的细节。这一过程有效引导LLM-as-a-Judge提供更详细的CoT评判。大量实验表明，我们的方法提高了评估的可靠性，在五个基准测试中平均准确率提升了6.7%。此外，我们的方法生成了更高质量的CoT，有助于评判蒸馏，并在监督微调（SFT）的拒绝采样（称为众包拒绝采样）中表现出更优的性能，从而实现更高效的SFT。我们的分析证实，由我们生成的CoT更为全面且质量更高，且随着推理规模的扩大，评估准确率也随之提升。

English

LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become a widely adopted auto-evaluation method. However, its reliability is compromised by the CoT reasoning's inability to capture comprehensive and deeper details, often leading to incomplete outcomes. Existing methods mainly rely on majority voting or criteria expansion, which is insufficient to address the limitation in CoT. We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses, thereby exposing deeper and more comprehensive details within the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed CoT judgment. Extensive experiments demonstrate that our approach enhances evaluation reliability, achieving an average accuracy gain of 6.7% across five benchmarks. Moreover, our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling for supervised fine-tuning (SFT), referred to as crowd rejection sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs generated by ours are more comprehensive and of higher quality, and evaluation accuracy improves as inference scales.

群体对比推理：为LLM-as-a-Judge开启全面评估之门

Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge

摘要

Summary

Support

Support