从生成到判断:LLM作为法官的机遇和挑战

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

November 25, 2024
作者: Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, Huan Liu
cs.AI

摘要

评估和评价长期以来一直是人工智能(AI)和自然语言处理(NLP)中的关键挑战。然而,无论是基于匹配还是基于嵌入的传统方法,往往难以判断微妙属性并提供令人满意的结果。最近大型语言模型(LLMs)的进展启发了“LLM作为评判者”的范式,其中LLMs被利用来在各种任务和应用中执行评分、排名或选择。本文提供了LLM为基础的评判和评估的全面调查,提供了深入的概述以推动这一新兴领域。我们首先从输入和输出的角度给出详细的定义。然后,我们引入了一个全面的分类法,从三个维度探讨LLM作为评判者:何以评判、如何评判和何处评判。最后,我们编制了用于评估LLM作为评判者的基准,并强调了关键挑战和有前途的方向,旨在提供有价值的见解并激发这一有前途的研究领域的未来研究。有关LLM作为评判者的论文列表和更多资源,请访问https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge和https://llm-as-a-judge.github.io。
English
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge and https://llm-as-a-judge.github.io.

Summary

AI-Generated Summary

PDF412November 26, 2024