从生成到判断：LLM作为法官的机遇和挑战

摘要

评估和评价长期以来一直是人工智能（AI）和自然语言处理（NLP）中的关键挑战。然而，无论是基于匹配还是基于嵌入的传统方法，往往难以判断微妙属性并提供令人满意的结果。最近大型语言模型（LLMs）的进展启发了“LLM作为评判者”的范式，其中LLMs被利用来在各种任务和应用中执行评分、排名或选择。本文提供了LLM为基础的评判和评估的全面调查，提供了深入的概述以推动这一新兴领域。我们首先从输入和输出的角度给出详细的定义。然后，我们引入了一个全面的分类法，从三个维度探讨LLM作为评判者：何以评判、如何评判和何处评判。最后，我们编制了用于评估LLM作为评判者的基准，并强调了关键挑战和有前途的方向，旨在提供有价值的见解并激发这一有前途的研究领域的未来研究。有关LLM作为评判者的论文列表和更多资源，请访问https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge和https://llm-as-a-judge.github.io。

English

Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge and https://llm-as-a-judge.github.io.

从生成到判断：LLM作为法官的机遇和挑战

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

摘要

Summary

Support