群体对比推理:为LLM-as-a-Judge开启全面评估之门
Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge
February 18, 2025
作者: Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
cs.AI
摘要
LLM-as-a-Judge作为一种生成链式思维(CoT)评判的自动评估方法,已被广泛采用。然而,其可靠性因CoT推理无法捕捉全面且深入的细节而受到损害,往往导致不完整的结果。现有方法主要依赖于多数投票或标准扩展,这不足以解决CoT的局限性。我们提出了基于众包的比较评估方法,通过引入额外的众包响应与候选响应进行比较,从而揭示候选响应中更深层次和更全面的细节。这一过程有效引导LLM-as-a-Judge提供更详细的CoT评判。大量实验表明,我们的方法提高了评估的可靠性,在五个基准测试中平均准确率提升了6.7%。此外,我们的方法生成了更高质量的CoT,有助于评判蒸馏,并在监督微调(SFT)的拒绝采样(称为众包拒绝采样)中表现出更优的性能,从而实现更高效的SFT。我们的分析证实,由我们生成的CoT更为全面且质量更高,且随着推理规模的扩大,评估准确率也随之提升。
English
LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become
a widely adopted auto-evaluation method. However, its reliability is
compromised by the CoT reasoning's inability to capture comprehensive and
deeper details, often leading to incomplete outcomes. Existing methods mainly
rely on majority voting or criteria expansion, which is insufficient to address
the limitation in CoT. We propose Crowd-based Comparative Evaluation, which
introduces additional crowd responses to compare with the candidate responses,
thereby exposing deeper and more comprehensive details within the candidate
responses. This process effectively guides LLM-as-a-Judge to provide a more
detailed CoT judgment. Extensive experiments demonstrate that our approach
enhances evaluation reliability, achieving an average accuracy gain of 6.7%
across five benchmarks. Moreover, our method produces higher-quality CoTs that
facilitate judge distillation and exhibit superior performance in rejection
sampling for supervised fine-tuning (SFT), referred to as crowd rejection
sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs
generated by ours are more comprehensive and of higher quality, and evaluation
accuracy improves as inference scales.Summary
AI-Generated Summary